CN116975301A - Text clustering method, text clustering device, electronic equipment and computer readable storage medium - Google Patents

Text clustering method, text clustering device, electronic equipment and computer readable storage medium Download PDF

Info

Publication number
CN116975301A
CN116975301A CN202311231947.9A CN202311231947A CN116975301A CN 116975301 A CN116975301 A CN 116975301A CN 202311231947 A CN202311231947 A CN 202311231947A CN 116975301 A CN116975301 A CN 116975301A
Authority
CN
China
Prior art keywords
text
word
features
feature
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311231947.9A
Other languages
Chinese (zh)
Inventor
刘晓滨
张伟
赵博
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202311231947.9A priority Critical patent/CN116975301A/en
Publication of CN116975301A publication Critical patent/CN116975301A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a text clustering method, a text clustering device, electronic equipment and a computer-readable storage medium; after a plurality of program feedback texts of a target application are obtained, text segmentation is carried out on text contents in the program feedback texts to obtain at least one text word corresponding to each program feedback text, feature extraction is carried out on the text word to obtain word features of the text word, then association weights of each text word in the program feedback texts are determined based on the word features, feature transformation is carried out on the word features according to the association weights to obtain target text features of each program feedback text, and then clustering is carried out on the plurality of program feedback texts based on the target text features to obtain text types of each program feedback text; the scheme can improve the accuracy of text clustering.

Description

Text clustering method, text clustering device, electronic equipment and computer readable storage medium
Technical Field
The present invention relates to the field of data processing, and in particular, to a text clustering method, a device, an electronic apparatus, and a computer readable storage medium.
Background
In recent years, with the rapid development of internet technology, application programs are also becoming more and more widely used. Various errors and even abnormal degradation situations can be avoided in the running process of the application program, and the application program can feed back a series of texts (program feedback texts) for a program developer to refer to the improved program. Since the application program often runs in a plurality of terminal devices, massive texts are fed back, and when the texts are analyzed, the texts need to be clustered. For program feedback text, the current text clustering mode usually extracts keywords in the text by manually involving a series of rules, and considers that the keywords complete the same text in the same category (i.e. correspond to the same question).
In the research and practice process of the current technology, the inventor finds that in the complex situations of different program compiling modes, running environments, installation paths or system versions and the like of the application program in the running process, the text fed back by the application program aiming at the same problem or error often generates differences, so that keywords extracted based on rules are often different, and therefore, the text clustering accuracy is lower.
Disclosure of Invention
The embodiment of the invention provides a text clustering method, a text clustering device, electronic equipment and a computer readable storage medium, which can improve the accuracy of text clustering.
A text clustering method, comprising:
acquiring a plurality of program feedback texts of a target application, wherein the program feedback texts comprise at least one line of text content;
text word segmentation is carried out on the text content to obtain at least one text word corresponding to each program feedback text, and feature extraction is carried out on the text word to obtain word features of the text word;
determining an association weight of each text word in the program feedback text based on the word characteristics, wherein the association weight indicates the context information of the text word in the program feedback text;
according to the association weight, carrying out feature transformation on the word features to obtain target text features of the feedback text of each program;
and clustering the multiple program feedback texts based on the target text characteristics to obtain the text category of each program feedback text.
Correspondingly, an embodiment of the present invention provides a text clustering device, including:
the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of program feedback texts of a target application, and the program feedback texts comprise at least one line of text content;
The extraction unit is used for carrying out text word segmentation on the text content to obtain at least one text word corresponding to each program feedback text, and carrying out feature extraction on the text word to obtain word features of the text word;
a determining unit, configured to determine, based on the word feature, an association weight of each text word in the program feedback text, where the association weight indicates context information of the text word in the program feedback text;
the feature transformation unit is used for carrying out feature transformation on the word features according to the association weight so as to obtain target text features of the feedback text of each program;
and the clustering unit is used for clustering the multiple program feedback texts based on the target text characteristics to obtain the text category of each program feedback text.
In some embodiments, the extracting unit may be specifically configured to obtain feature mapping information corresponding to the target application, where the feature mapping information is used to map the text word into a word feature; and carrying out feature mapping on the text words based on the feature mapping information to obtain word features of each text word.
In some embodiments, the extracting unit may be specifically configured to obtain a preset word vector set corresponding to the feature mapping information; and screening word vectors corresponding to each text word from the preset word vector set based on the feature mapping information, and taking the word vectors as word features of the text words.
In some embodiments, the determining unit may be specifically configured to perform position encoding on the text word to obtain a position feature of the text word; fusing the position features of the text words with the corresponding word features to obtain current word features; and extracting associated features from the current word features, and determining the associated weight of each text word in the program feedback text based on the associated features.
In some embodiments, the determining unit may be specifically configured to calculate, based on the query feature and the key feature, a relevance value between the text words in the program feedback text, where the relevance value indicates a degree of relevance between text words in different lines or in the same line; normalizing the relevance value to obtain the attention weight of each text word in the program feedback text, and taking the attention weight as the associated weight of the text word.
In some embodiments, the feature transformation unit may be specifically configured to weight the value feature based on the association weight to obtain a target word feature of the text word; and fusing the target word characteristics of the text words in the program feedback texts to obtain the target text characteristics of each program feedback text.
In some embodiments, the feature transformation unit may be specifically configured to weight the value feature based on the association weight to obtain a weighted value feature of the text word; performing feature conversion on the weighted value features, and taking the converted value features as current word features of the text words; and returning to the step of extracting the related features from the current word features until the preset feature conversion times are reached, and obtaining the target word features of the text words.
In some embodiments, the feature transformation unit may be specifically configured to screen at least one target word feature corresponding to each program feedback text from the target word features to obtain a word feature set corresponding to each program feedback text; and calculating the feature mean value of the word features in the word feature set to obtain the target text features of the program feedback text.
In some embodiments, the feature transformation unit may be specifically configured to screen at least one target word feature corresponding to each program feedback text from the target word features to obtain a word feature set corresponding to each program feedback text; acquiring a text position of each text word in the program feedback text, and screening word characteristics corresponding to a preset text position from the word characteristic set based on the text position; and taking the word characteristic corresponding to the preset text position as the target text characteristic of the program feedback text.
In some embodiments, the clustering unit may be specifically configured to obtain density distribution information of the target text feature in the current feature space; calculating feature distances among the target text features under the current feature space; and clustering the program feedback texts based on the density distribution information and the feature distance to obtain the text category of each program feedback text.
In some embodiments, the clustering unit may be specifically configured to calculate a feature distance between the target text feature and a cluster center of at least one preset category, and update the cluster center based on the feature distance; taking the updated clustering center as the clustering center, and returning to execute the step of calculating the feature distance between the target text feature and the clustering center of at least one preset category until reaching a preset stopping condition, so as to obtain a target category corresponding to each target text feature; and taking the target category as the text category of the program feedback text corresponding to the target text characteristic.
In some embodiments, the clustering unit may be specifically configured to count, in a plurality of program feedback texts, a number of texts of the program feedback text corresponding to each text category; and sorting the text categories according to the text quantity, and screening target text categories from the text categories based on sorting results.
In some embodiments, the extracting unit may be specifically configured to perform content detection on the text content to obtain a content detection result of the text content; based on the content detection result, adjusting the text content to obtain target text content; and segmenting the target text content to obtain at least one text word corresponding to each program feedback text.
In addition, the embodiment of the application also provides electronic equipment, which comprises a processor and a memory, wherein the memory stores application programs, and the processor is used for running the application programs in the memory so as to execute the text clustering method provided by the embodiment of the application.
In addition, the embodiment of the application also provides a computer readable storage medium, which stores a plurality of instructions, wherein the instructions are suitable for being loaded by a processor to execute the steps in any text clustering method provided by the embodiment of the application.
In addition, the embodiment of the application also provides a computer program product, which comprises a computer program or instructions, and the computer program or instructions realize the steps in the text clustering method provided by the embodiment of the application when being executed by a processor.
After a plurality of program feedback texts of a target application are obtained, text segmentation is carried out on text contents in the program feedback texts to obtain at least one text word corresponding to each program feedback text, feature extraction is carried out on the text word to obtain word features of the text word, then association weights of each text word in the program feedback texts are determined based on the word features, feature transformation is carried out on the word features according to the association weights to obtain target text features of each program feedback text, and then clustering is carried out on the plurality of program feedback texts based on the target text features to obtain text types of each program feedback text; according to the method and the device, on the basis of acquiring the word characteristics of the full text information of the program feedback text, each association weight in the program feedback text is determined, so that the context information of text words in the program feedback text is introduced, the target text characteristics representing the program feedback text can be extracted more accurately, further, different program feedback texts can be fed back due to differences of running environments and the like in the same problem (category), and meanwhile, differences among similar program feedback texts belonging to different problems can be distinguished, so that the accuracy of text clustering can be improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic view of a text clustering method according to an embodiment of the present invention;
FIG. 2 is a schematic flow chart of a text clustering method according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of converting text words into word vectors provided by an embodiment of the present invention;
FIG. 4 is a schematic diagram of feature transformation of word features using a transducer model according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of clustering target text features according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a clustering result of target text features provided by an embodiment of the present invention;
FIG. 7 is a schematic flow chart of text clustering for error text provided by an embodiment of the present invention;
FIG. 8 is another schematic flow chart of a text clustering method according to an embodiment of the present invention;
Fig. 9 is a schematic structural diagram of a text clustering device according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.
The embodiment of the application provides a text clustering method, a text clustering device, electronic equipment and a computer readable storage medium. The text clustering device can be integrated in electronic equipment, and the electronic equipment can be a server or a terminal and other equipment.
The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, network acceleration services (Content Delivery Network, CDN), basic cloud computing services such as big data and an artificial intelligent platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the present application is not limited herein.
For example, referring to fig. 1, taking the text clustering device integrated in an electronic device as an example, the electronic device may perform text segmentation on text content in a program feedback text after obtaining a plurality of program feedback texts of a target application, obtain at least one text word corresponding to each program feedback text, perform feature extraction on the text word to obtain word features of the text word, then determine association weights of each text word in the program feedback text based on the word features, perform feature transformation on the word features according to the association weights to obtain target text features of each program feedback text, then cluster a plurality of program feedback texts based on the target text features to obtain text types of each program feedback text, and further improve accuracy of text clustering.
The text clustering method provided by the embodiment of the application relates to the direction of natural language processing (Nature Language processing, NLP) of Artificial Intelligence, AI in artificial intelligence. According to the embodiment of the application, a plurality of program feedback texts of a target application can be obtained, word segmentation and feature extraction are carried out on text contents in the program feedback texts, then feature fusion transformation is carried out on the extracted word features through a large language model, so that target text features of each program feedback text are obtained, and then the plurality of program feedback texts are clustered based on the target text features.
Wherein artificial intelligence is the intelligence of simulating, extending and expanding a person using a digital computer or a machine controlled by a digital computer, sensing the environment, obtaining knowledge, and using knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, pre-training model technologies, operation/interaction systems, mechatronics, and the like. The pre-training model is also called a large model and a basic model, and can be widely applied to all large-direction downstream tasks of artificial intelligence after fine adjustment. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
Among them, natural language processing is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. The natural language processing relates to natural language, namely the language used by people in daily life, and is closely researched with linguistics; and also to computer science and mathematics. An important technique for model training in the artificial intelligence domain, a pre-training model, is developed from a large language model (Large Language Model) in the NLP domain. Through fine tuning, the large language model can be widely applied to downstream tasks. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.
It will be appreciated that, in the embodiments of the present application, related data such as program feedback text or other text is involved, when the following embodiments of the present application are applied to specific products or technologies, permission or agreement is required, and the collection, use and processing of related data is required to comply with relevant laws and regulations and standards of relevant countries and regions.
The following will describe in detail. The following description of the embodiments is not intended to limit the preferred embodiments.
The embodiment will be described from the perspective of a text clustering device, which may be integrated in an electronic device, where the electronic device may be a server or a device such as a terminal; the terminal may include a tablet computer, a notebook computer, a personal computer (PC, personal Computer), a wearable device, a virtual reality device, or other devices such as an intelligent device capable of text clustering.
A text clustering method, comprising:
acquiring a plurality of program feedback texts of a target application, wherein the program feedback texts comprise at least one line of text content, performing text word segmentation on the text content to obtain at least one text word corresponding to each program feedback text, performing feature extraction on the text word to obtain word features of the text word, determining association weights of each text word in the program feedback texts based on the word features, indicating context information of the text word in the program feedback texts, performing feature transformation on the word features according to the association weights to obtain target text features of each program feedback text, and clustering the plurality of program feedback texts based on the target text features to obtain text types of each program feedback text.
As shown in fig. 2, the specific flow of the text clustering method is as follows:
101. and acquiring a plurality of program feedback texts of the target application.
The target application may be at least one application that may feed back text by a feedback program when unexpected conditions or erroneous degradation occur in operation. The program feedback text is the information of the function call link which generates errors for the application program to feedback in text form when unexpected conditions occur in the operation or errors exit from the operation, and for example, the information can comprise the file name, function name, link library name or compiling position of the existing call, so the program feedback text can also be called as error report text or error report stack text. The program feedback text may include at least one line of text content that may include various types of function call link information that produced the error, etc.
The method for obtaining the multiple program feedback texts of the target application may be various, and specifically may be as follows:
for example, multiple program feedback texts of the target application uploaded by the terminal or the client may be directly acquired, or multiple program feedback texts of the target application may be extracted from a network or a text database, or the target application may be run, and multiple program feedback texts returned by a server of the target application may be received, or when the number of the program feedback texts exceeds a preset number threshold or the content space occupied by the program feedback texts is large, a text clustering request for the target application may be received, where the text clustering request carries storage addresses of the multiple program feedback texts, and multiple program feedback texts of the target application may be acquired based on the storage addresses.
102. Text word segmentation is carried out on the text content to obtain at least one text word corresponding to the feedback text of each program, and feature extraction is carried out on the text word to obtain word features of the text word.
A text word is understood to mean a word or phrase consisting of at least one text character in the text content.
The text content may be segmented in various ways, and the method specifically includes the following steps:
for example, content detection may be performed on the text content to obtain a content detection result of the text content, based on the content detection result, the text content may be adjusted to obtain a target text content, word segmentation may be performed on the target text content to obtain at least one text word corresponding to each program feedback text, or text word segmentation may also be directly performed on all text contents in the program feedback text to obtain at least one text word corresponding to each program feedback text, and so on.
The text content may be detected in various ways, for example, noise detection may be performed in the text content, so as to obtain a content detection result of the text content, or an explanatory text may be detected in the text content, or the like.
After the text content is detected, the text content can be adjusted based on the content detection result, so that the target text content is obtained. The text content may be adjusted in various ways, for example, when the content detection result indicates that noise information exists in the text content, noise information, such as a system path, irrelevant to a program may be deleted from the text content, so as to obtain the target text content, or when the content detection result indicates that no explanatory text exists in the text content, an explanatory text corresponding to the text content may be added to the text content, so as to obtain the target text content, and so on.
After the text content is adjusted, the adjusted target text content can be segmented, so that at least one text word corresponding to each program feedback text is obtained. The method for word segmentation of the target text content may be various, for example, a maximum word segmentation matching algorithm, a shortest path word segmentation algorithm, a generated model word segmentation algorithm, a discriminant word segmentation algorithm, a neural network word segmentation algorithm or the like may be used for word segmentation of the target text content, so as to obtain at least one text word corresponding to each program feedback text, or other word segmentation algorithms capable of text word segmentation may be used for text word segmentation of the target text content, so as to obtain at least one text word corresponding to each program feedback text, and so on.
After text segmentation is performed on the program feedback text, the program feedback text can be expressed as: t= [ w ] 1 ,w 2 ,…,wn]Which is provided withW in i Represents one word in the text, and n represents the number of words in the entire text. For example, the program feedback text includes the following:
0 libsystem_kernel.dylib __ pthread_kill+8// first line text content
1 libsystem_c.dylib _abart+100// second line of text content
2 libsystem_c.dylib _err// third line text content
After text segmentation of the program feedback text, the program feedback text may be represented as a word list: t= [ '0', 'lib', 'system', 'kernel', … … ].
The text word segmentation method for all text contents in the program feedback text may be similar to the text word segmentation method for the target text contents, which is described in detail above, and will not be repeated here.
After text segmentation is performed on the text content, feature extraction can be performed on at least one text word corresponding to the text fed back by the segmented program, so that word features of the text word are obtained. The feature extraction method for the text words can be various, for example, feature mapping information corresponding to the target application can be obtained, and feature mapping is performed on the text word mapping based on the feature mapping information, so as to obtain word features of each text word.
The feature mapping information may be used to map text words into word features, and the types of feature mapping information may be various, for example, may include a word mapping table or other information that may map text words into word features, and so on. Taking feature mapping information as a word mapping table for example, each word (text word) in the word mapping table corresponds to a unique word vector, and the word vector corresponding to each text word can be identified through the word mapping table, so that the word feature of each text word is obtained. A pre-set word mapping table may be stored in the large language model. The feature mapping information may be used to map the text words in various ways, for example, a preset word vector set corresponding to the feature mapping information may be obtained, and based on the feature mapping information, a word vector corresponding to each text word is selected from the preset word vector set, and the word vector is used as a word feature of the text word.
Taking feature mapping information as a word mapping table as an example, converting each text word (word) into a corresponding word vector according to the word mapping table through a large language model: v i =Embedding(w i ) Embedding () represents a word mapping table of text words to vectors, v i After the mapped word vectors are represented and text words in the program feedback text are mapped into corresponding word vectors one by one, the word vectors can be used as word characteristics of the text words, and one program feedback text can be represented by a word vector sequence: v= [ V 1 ,v 2 ,……v n ]. The process of converting text words into word vectors may be as shown in fig. 3.
103. Based on the word characteristics, an associated weight for each text word in the program feedback text is determined.
Wherein the associated weight indicates context information of the text word in the program feedback text, and the context information can be information of relevance between texts of different lines or the same line in one program feedback text.
The method for determining the association weight of each text word in the program feedback text based on the word characteristics can be various, and specifically can be as follows:
for example, the text word may be position-coded to obtain a position code of the text word, the position feature of the text word is fused with the corresponding word feature to obtain a current word feature, the associated feature is extracted from the current word feature, and the associated weight of each text word in the program feedback text is determined based on the associated feature.
The method for performing position coding on the text word may be various, for example, position information of the text word in a program feedback text may be obtained, and based on the position information, the text word is subjected to position coding, so as to obtain a position feature of the text word.
After the text word is subjected to position coding, the position feature and the corresponding word feature can be fused, so that the current word feature is obtained. The method of fusing the position feature and the corresponding word feature may be various, for example, the position feature and the corresponding word feature may be directly spliced to obtain the current word feature of the text word, or the fusion weight may be obtained, the position feature and the word feature may be weighted based on the fusion weight, the weighted position feature and the weighted word feature may be fused to obtain the current word feature of the text word, and so on.
After the position features of the text words are fused with the corresponding word features, the associated features can be extracted from the fused current word features. The method for extracting the association feature from the current word feature may be various, for example, feature conversion may be performed on the current word feature by using a query matrix, a key matrix and a value matrix in a preset weight matrix, to obtain a query feature corresponding to the query matrix, a key feature corresponding to the key matrix and a value feature corresponding to the value matrix, and the query feature (Q), the key feature (K) and the value feature (V) are used as the association feature.
It should be noted that the number of the associated features may be one or more, and one associated feature may include one query feature, one key feature, and one value feature.
After the associated features are extracted, and based on the associated features, an associated weight for each text word in the program feedback text is determined. There may be various ways to determine the relevance weight of each text word in the program feedback text, for example, based on the query feature and the key feature, a relevance value between text words in the program feedback text may be calculated, and the relevance value may be normalized to obtain the attention weight of each text word in the program feedback text, where the attention weight is used as the relevance weight of the text word.
Wherein the relevance value indicates a degree of relevance between text words of different lines or the same line. The method for calculating the relevance value between the text words in the program feedback text can be various, for example, the target text word can be screened out in the program feedback text, the product between the query feature of the target text word and the key feature of the text word except the target text word is calculated, so that the relevance value of the target text word is obtained, and the step of screening out the target text word in the program feedback text is performed until the text words in the program feedback text are all the target text words, so that the relevance value corresponding to each text word in the program feedback text is obtained.
After calculating the correlation value corresponding to each text word in the program feedback text, the correlation value may be normalized to obtain the attention weight of each text word in the program feedback text. There are various ways to normalize the correlation value, for example, the correlation value may be normalized, so as to obtain a normalized target correlation value, and the target correlation value is converted into a probability distribution value by using a softmax (normalized exponential function), so as to obtain the attention weight of each text word.
After normalizing the correlation value, the attention weight obtained after normalization can be used as the correlation weight of the text word.
104. And carrying out feature transformation on the word features according to the associated weights to obtain target text features of the feedback text of each program.
For example, the value features in the associated features may be weighted based on the associated weights to obtain target word features of the text words, and the target word features of the text words in the program feedback text may be fused to obtain target text features of the program feedback text.
The method for weighting the median feature of the associated feature based on the associated weight may be various, for example, the value feature may be weighted based on the associated weight to obtain a weighted value feature of the text word, the weighted value feature is subjected to feature conversion, the converted value feature is used as a current word feature of the text word, and the step of extracting the associated feature from the current word feature is performed again until the number of times of feature conversion reaches a preset number of times to obtain a target word feature of the text word.
The method for weighting the median feature of the associated feature may be various based on the associated weight, for example, when the number of the associated features is one, the associated weight may be used to weight the value feature in the associated feature, so as to obtain a weighted value feature of the text word, when the number of the associated features is multiple, the associated weight may be used to weight the value feature in the associated features, so as to obtain multiple initial weighted value features, and the multiple initial weighted value features may be fused to obtain a weighted value feature of the text word.
Wherein, attention network (Attention) can be adopted to extract the associated feature from the current word feature of the text word, and based on the associated feature, the associated weight of each text word in the program feedback text is determined, and based on the associated weight, the value feature in the associated feature is weighted, so as to obtain the weighted value feature of the text word. The network structure of the Attention network may be various, and for example, may include a self-Attention network (self-Attention), a multi-head Attention network, an Attention network, other types of Attention networks, and the like.
After weighting the value features in the associated features, feature transformations may be performed on the weighted value features. The weighted value features may be subjected to feature conversion in various manners, for example, a feedforward neural network (Feed Forward Neural Network) may be used to perform feature conversion on the weighted value features to obtain converted value features, or other conversion networks may be used to perform feature conversion on the weighted value features, to perform feature conversion on the weighted value features to obtain converted value features, or the like.
After feature conversion of the weighted value features, the converted value features can be used as current word features of the text word. And then, returning to the step of extracting the association step from the current word characteristics until the preset conversion times are reached, so that the target word characteristics of the text word can be obtained.
The preset conversion times can be preset times of performing feature conversion on the weighted value features. Taking the characteristic conversion of the weighted value characteristic by the feedforward neural network as an example, the preset conversion times can also indicate the number or the layer number of the feedforward neural network. The transducer (a text processing network) model may include a feedforward neural network and an attention network, and the preset number of transitions may be understood as the number or layer number of the transducer network. Therefore, one or more layers of the Transformer models can be adopted to extract the characteristics of the text words, the association weight of each text word in the program feedback text is determined based on the extracted word characteristics, and the word characteristics are subjected to characteristic transformation according to the association weights, so that the target text characteristics of the text words are obtained.
The process of converting word characteristics by using one or more layers of convertors models to obtain target word characteristics of text words can be as shown in fig. 4, and the word characteristics of text words in the program feedback text are v 1 - v n For example, the word features can be converted to f 1-fn using a transducer model.
After the value characteristics are weighted based on the association weights to obtain the target word characteristics of the text words, the target word characteristics of the text words in the program feedback texts can be fused, so that the target text characteristics of each program feedback text are obtained. The method of fusing the target word features of the text words in the program feedback text may be various, for example, at least one target word feature corresponding to each program feedback text may be selected from the target word features to obtain a word feature set corresponding to each program feedback text, a feature mean of the word features in the word feature set is calculated to obtain a target text feature of the program feedback text, or at least one target word feature corresponding to each program feedback text may be selected from the target word features to obtain a word feature set corresponding to each program feedback text, a text position of each text word in the program feedback text is obtained, a word feature corresponding to a preset text position is selected from the word feature set based on the text position, the word feature corresponding to the preset text position is used as the target text feature of the program feedback text, and so on.
The means for calculating the feature mean of the word features in the word feature set may be various, for example, the feature value of each word feature in the word feature set may be obtained, the mean of the feature values is calculated, and the feature value mean is used as the target text feature of the program feedback text corresponding to the word feature set, which may be shown in formula (1), specifically may be as follows:
wherein X is the target text feature of the program feedback text, n is the number of text words in the program feedback text, i is the ith text word, f i Is the target word feature of the ith text word.
Wherein for N programs the text (t 1 ,t 2 ,t 3 ,…,t N ) For example, the above process may be adopted to sequentially extract corresponding target text features, where the target text features may be sequentially expressed as: x is X 1 ,X 2 ,X 3 ,……X N
Where text position may be understood as the position of a text word in the program feedback text. The preset text position may be a preset text position, for example, may include a first text word, a last text word, at least one text word in the middle, or any other text position. Taking a preset text position as a first text word as an example, the word characteristics corresponding to the first text word can be screened out from the word characteristic set, and the word characteristics corresponding to the first text word are used as target text characteristics of the corresponding program feedback text. Taking a preset text position as a last text word as an example, the word characteristic corresponding to the last text word can be screened out from the word characteristic set, and the word characteristic corresponding to the last text word is used as a target text characteristic of a corresponding program feedback text, and the like.
105. And clustering the multiple program feedback texts based on the target text characteristics to obtain the text category of each program feedback text.
The text category can be understood as the category of the clustering cluster corresponding to the clustered text fed back by the program.
The clustering method for the multiple program feedback texts based on the target text features may be various, and specifically may be as follows:
for example, density distribution information of the target text features in the current feature space may be obtained, feature distances between the target text features are calculated in the current feature space, the program feedback texts are clustered based on the density distribution information and the feature distances to obtain text types of each program feedback text, or feature distances between the target text features and a clustering center of at least one preset type may be calculated, the clustering center is updated based on the feature distances, the updated clustering center is used as the clustering center, and the step of executing the calculation of the feature distances between the target text features and the clustering center of at least one preset type is performed until a preset stop condition is reached, so as to obtain a target type corresponding to each target text, or a plurality of program feedback texts may be clustered based on other clustering algorithms based on the target text features, so as to obtain text types of each program feedback text, and so on.
The density distribution information may be information of density distribution of the target text feature in the corresponding current feature space. Based on the density distribution information and the feature distance, the clustering method for the program feedback text can be various, for example, a DBSCAN (clustering method) clustering interface can be adopted, and according to the density distribution information, the target text features which are relatively close to each other and relatively far from other target text features are classified into one type, so that a clustering result corresponding to each target text feature is obtained, and based on the clustering result, the text type of the program feedback text corresponding to the target text feature can be determined.
In the clustering process of the target text features, as shown in fig. 5, the same cluster includes target text features classified into the same class, before clustering, none of the target text features has class information, and after clustering, some target text features with a relatively short distance are classified into the same class (i.e., the target text features in the virtual circles are classified into one class). After the target text feature is clustered, the program feedback text corresponding to the target text feature in the same cluster can be regarded as the program feedback text of the same text category. Taking the program feedback text as the error reporting text as an example, the error reporting text corresponding to the target text characteristics in the same cluster can be regarded as the same error reporting problem. The clustering result of the target text features can be shown in fig. 6, each block represents a target text feature, the blocks of the same style represent the target text features clustered to the same cluster, the blocks of different styles represent different clusters obtained by clustering, and as can be seen in fig. 6, the clustered target text features close to each other are divided into the same cluster, so that the program feedback text is regarded as the same text category.
The cluster center may be a feature center of a cluster corresponding to a preset category. For this text clustering manner, a K-means (a clustering algorithm) algorithm may be used to calculate a feature distance between a target text feature and a clustering center of at least one preset category, update the clustering center based on the feature distance, take the updated clustering center as the clustering center, and return to the step of executing the calculation of the clustering center of the target text feature and the at least one preset category until a preset stop condition is reached, so as to obtain a clustering result corresponding to each target text feature, and determine, based on the clustering result, a text category of a program feedback text corresponding to the target text feature.
After clustering the plurality of program feedback texts based on the target text features, the text quantity of the program feedback text corresponding to each text category can be counted in the plurality of program feedback texts, the text categories are ordered according to the text quantity, and the target text category is screened out from the text categories based on the ordering result.
Taking the program feedback text as the error reporting text as an example, the text types of the error reporting text are ordered, so that the high-frequency problem (target text type) can be obtained, and the high-frequency text is processed or optimized preferentially, and the like.
Taking the program feedback text as the error reporting text as an example, the proposal can adopt a large language model to perform cluster analysis on the error reporting text, and the main steps can comprise feature extraction and feature clustering of the error reporting text, and can be particularly shown in figure 7. According to the scheme, the characteristics of the error reporting text are extracted by using the large language model, the model can acquire the full text information of the error reporting text and comprehensively learn the interrelationship among different rows in the error reporting text through a multi-layer attention mechanism, so that the context information of the error reporting text is comprehensively extracted, the characteristic vector (target text characteristic) capable of representing the error reporting text is extracted, the problem that the same problem feeds back different error reporting texts due to the difference of the running environment and the like can be solved, and meanwhile, the difference among similar error reporting texts belonging to different texts can be resolved.
Taking a program feedback text as an error reporting text as an example, the scheme uses the error reporting text running on line for verification, and the verification samples have 3549 error reporting texts, and are from two platforms, i.e. an iOS (system platform) and an Android (system platform). And (3) performing category labeling on the error-reported texts in a manual labeling mode, wherein in the verification process, the accuracy of similar sample pairs after clustering and F1-score (a sort evaluation index) are adopted as evaluation indexes, and in addition, two different large language models of the disclosed StarEncoder (a large language model) and CodeGeex2 (a large language model) are adopted to extract text features, and a DBSCAN clustering algorithm is used for clustering the extracted features. Compared with the traditional rule-based method, the improvement effect of the scheme can be shown in table 1:
TABLE 1
Compared with a clustering method based on rules, the method can more accurately cluster the error-reported text, and the F1-score is improved by 31.4% when a CodeGeex2 large language model is used. Therefore, the invention can provide more accurate error reporting statistical information for program developers, thereby further improving the analysis and solution efficiency of the error reporting problem of the program.
The application scenario of the text clustering method for the program feedback text in the scheme can be various, and the application scenario specifically can be as follows:
(1) Analyzing and processing feedback information of various running programs in a program management platform;
(2) In the error information analysis task of various software, thereby assisting developers in analyzing and optimizing the software;
(3) Establishing a development auxiliary document, retrieving by extracting the characteristics of the error-reporting text information, thereby returning similar error-reporting conditions and corresponding solutions, assisting an algorithm development flow, and the like.
It should be noted that, the text clustering method may also be applied to feature extraction and cluster analysis of general texts (texts other than program feedback texts), to implement summary arrangement of similar texts, and so on.
As can be seen from the above, after obtaining a plurality of program feedback texts of a target application, the embodiment of the application performs text word segmentation on text content in the program feedback texts to obtain at least one text word corresponding to each program feedback text, performs feature extraction on the text word to obtain word features of the text word, then determines association weights of each text word in the program feedback texts based on the word features, performs feature transformation on the word features according to the association weights to obtain target text features of each program feedback text, and then clusters the plurality of program feedback texts based on the target text features to obtain text types of each program feedback text; according to the method and the device, on the basis of acquiring the word characteristics of the full text information of the program feedback text, each association weight in the program feedback text is determined, so that the context information of text words in the program feedback text is introduced, the target text characteristics representing the program feedback text can be extracted more accurately, further, different program feedback texts can be fed back due to differences of running environments and the like in the same problem (category), and meanwhile, differences among similar program feedback texts belonging to different problems can be distinguished, so that the accuracy of text clustering can be improved.
According to the method described in the above embodiments, examples are described in further detail below.
In this embodiment, the text clustering device is specifically integrated in an electronic device, the electronic device is a server, a program feedback text is a fault reporting text, and feature mapping information is illustrated as a word mapping table.
As shown in fig. 8, a text clustering method specifically includes the following steps:
201. the server obtains a plurality of error reporting texts of the target application.
For example, the server may directly obtain a plurality of error reporting texts of the target application uploaded by the terminal or the client, or may extract a plurality of error reporting texts of the target application from the network or the text database, or may run the target application, and receive a plurality of error reporting texts returned by the server of the target application, or may further receive a text clustering request for the target application when the number of error reporting texts exceeds a preset number threshold or the content space occupied by the error reporting texts is large, where the text clustering request carries a storage address of the error reporting texts, obtain a plurality of error reporting texts of the target application based on the storage address, and so on.
202. The server performs text word segmentation on the text content to obtain at least one text word corresponding to each error text.
For example, the server may perform noise detection in the text content to obtain a content detection result of the text content, or may also detect explanatory text in the text content, or the like.
When the content detection result indicates that noise information exists in the text content, the server can delete noise information such as a system path and the like which are irrelevant to programs in the text content, so as to obtain target text content, or when the content detection result indicates that explanatory text does not exist in the text content, the server can add the explanatory text corresponding to the text content in the text content, so as to obtain the target text content, and the like.
The server may perform word segmentation on the target text content or text content by using a maximum word segmentation matching algorithm, a shortest path word segmentation algorithm, a generated model word segmentation algorithm, a discriminant word segmentation algorithm, a neural network word segmentation algorithm, or the like, so as to obtain at least one text word corresponding to each error text, or may perform text word segmentation on the target text content or text content by using other word segmentation algorithms capable of performing text word segmentation, so as to obtain at least one text word corresponding to each error text, or the like.
203. And the server performs feature extraction on the text words to obtain word features of the text words.
For example, the server may obtain a word mapping table corresponding to the target application in the large language model, obtain a preset word vector set corresponding to the word mapping table, screen a word vector corresponding to each text word from the preset word vector set based on the word mapping table, and use the word vector as a word feature of the text word.
204. The server determines an associated weight for each text word in the error-prone text based on the word characteristics.
For example, the server may obtain location information of the text word in the error text, and perform location encoding on the text word based on the location information, so as to obtain location characteristics of the text word.
The server may directly splice the position feature and the corresponding word feature to obtain the current word feature of the text word, or may also obtain a fusion weight, respectively weight the position feature and the word feature based on the fusion weight, fuse the weighted position feature and the weighted word feature to obtain the current word feature of the text word, and so on.
The server can respectively perform feature conversion on the current word features through a query matrix, a key matrix and a value matrix in a preset weight matrix to obtain query features corresponding to the query matrix, key features corresponding to the key matrix and value features corresponding to the value matrix, and takes the query features (Q), the key features (K) and the value features (V) as associated features.
The server can screen out target text words from the error-reporting text, calculate products between query features of the target text words and key features of text words except the target text words respectively, thereby obtaining correlation values of the target text words, and return to execute the step of screening out the target text words from the error-reporting text until the text words in the error-reporting text are all the target text words, thereby obtaining the correlation values corresponding to each text word in the error-reporting text.
The server may normalize the correlation values to obtain normalized target correlation values, and convert the target correlation values into probability distribution values using softmax (normalized exponential function), thereby obtaining the attention weight of each text word. And taking the attention weight obtained after normalization as the association weight of the text word.
205. The server weights the value features in the associated features based on the associated weights to obtain target word features of the text words
For example, when the number of associated features is one, the server may use the associated weights to weight the value features in the associated features, thereby obtaining weighted value features of the text word, when the number of associated features is multiple, use the associated weights to weight the value features in the multiple associated features, respectively, obtain multiple initial weighted value features, fuse the multiple initial weighted value features, obtain weighted value features of the text word, and so on.
The server may employ a feed forward neural network (Feed Forward Neural Network) to perform feature conversion on the weighted value features to obtain converted value features, or may employ other conversion networks that may perform feature conversion on the weighted value features to obtain converted value features, and so on.
And the server takes the converted value characteristic as the current word characteristic of the text word, and returns to execute the step of extracting the associated characteristic from the current word characteristic until the preset characteristic conversion times are reached, so as to obtain the target word characteristic of the text word.
206. And the server fuses the target word characteristics of the text words in the error text to obtain the target text characteristics of the error text.
For example, the server may screen at least one target word feature corresponding to each error-reporting text from the target word features, obtain a word feature set corresponding to each error-reporting text, obtain a feature value of each word feature in the word feature set, calculate an average value of the feature values, and use the average value of the feature values as a target text feature of the error-reporting text corresponding to the word feature set, as shown in formula (1), or may screen at least one target word feature corresponding to each error-reporting text from the target word features, obtain a word feature set corresponding to each error-reporting text, obtain a text position of each text word in the error-reporting text, screen a word feature corresponding to a preset text position from the word feature set based on the text position, use the word feature corresponding to the preset text position as a target text feature of the error-reporting text, and so on.
207. And the server clusters a plurality of error reporting texts based on the target text characteristics to obtain the text category of each error reporting text.
For example, the server may take density distribution information of the target text features in the current feature space, calculate feature distances between the target text features in the current feature space, use a DBSCAN (a clustering mode) clustering interface, divide the target text features that are relatively close to each other and relatively far from other target text features into a class according to the density distribution information, thereby obtaining a clustering result corresponding to each target text feature, determine a text class of a text in error corresponding to the target text feature based on the clustering result, or calculate feature distances between the target text features and a clustering center of at least one preset class by using a K-means (a clustering algorithm) algorithm, update the clustering center based on the feature distances, use the updated clustering center as a clustering center, and return to execute a step of calculating the clustering center of the target text features and the at least one preset class until a preset stop condition is reached, determine a text class of a text in error corresponding to the target text feature based on the clustering result, or determine a text class of a text in error corresponding to the target text feature based on the clustering result by using a clustering algorithm, and the other text feature in error corresponding to the target text feature based on the clustering result.
In some embodiments, after clustering the multiple error-reporting texts based on the target text features, the server may further count the number of text of the error-reporting text corresponding to each text category in the multiple error-reporting texts, sort the text categories according to the number of text, and screen the target text category from the text categories based on the sorting result.
As can be seen from the foregoing, after obtaining a plurality of error reporting texts of a target application, the server in this embodiment performs text word segmentation on text content in the error reporting texts to obtain at least one text word corresponding to each error reporting text, performs feature extraction on the text word to obtain word features of the text word, determines association weights of each text word in the error reporting text based on the word features, performs feature transformation on the word features according to the association weights to obtain target text features of each error reporting text, and then clusters the plurality of error reporting texts based on the target text features to obtain text types of each error reporting text; according to the method and the device, on the basis of acquiring the word characteristics of the full text information of the error reporting text, the context information of the text words in the error reporting text is introduced by determining each association weight in the error reporting text, so that the target text characteristics representing the error reporting text can be extracted more accurately, further, different error reporting texts can be fed back due to the differences of the running environment and the like in the same problem (category), and meanwhile, the differences among similar error reporting texts belonging to different problems can be distinguished, and therefore, the accuracy of text clustering can be improved.
In order to better implement the above method, the embodiment of the present invention further provides a text clustering device, where the text clustering device may be integrated in an electronic device, such as a server or a terminal, where the terminal may include a tablet computer, a notebook computer, and/or a personal computer.
For example, as shown in fig. 9, the text clustering apparatus may include an acquisition unit 301, an extraction unit 302, a determination unit 303, a feature transformation unit 304, and a clustering unit 305, as follows:
(1) An acquisition unit 301;
an obtaining unit 301 is configured to obtain a plurality of program feedback texts of the target application, where the program feedback text includes at least one line of text content.
For example, the obtaining unit 301 may specifically be configured to directly obtain a plurality of program feedback texts fed back by the target application during running, or extract a plurality of program feedback texts of the target application from a network or a text database, or the like.
(2) An extraction unit 302;
the extracting unit 302 is configured to perform text segmentation on the text content to obtain at least one text word corresponding to the feedback text of each program, and perform feature extraction on the text word to obtain word features of the text word.
For example, the extracting unit 302 may be specifically configured to perform text segmentation on the text content to obtain at least one text word corresponding to the feedback text of each program, obtain feature mapping information corresponding to the target application, and perform feature mapping on the text word mapping based on the feature mapping information to obtain word features of each text word.
(3) A determination unit 303;
a determining unit 303, configured to determine, based on the word characteristics, an association weight of each text word in the program feedback text, where the association weight indicates context information of the text word in the program feedback text.
For example, the determining unit 303 may be specifically configured to perform position encoding on a text word to obtain a position encoding of the text word, fuse a position feature of the text word with a corresponding word feature to obtain a current word feature, extract an associated feature from the current word feature, and determine an associated weight of each text word in the program feedback text based on the associated feature.
(4) A feature transformation unit 304;
and the feature transformation unit 304 is configured to perform feature transformation on the word feature according to the associated weight, so as to obtain a target text feature of the feedback text of each program.
For example, the feature transformation unit 304 may be specifically configured to weight the value feature in the associated feature based on the associated weight to obtain a target word feature of the text word, and fuse the target word feature of the text word in the program feedback text to obtain a target text feature of the program feedback text.
(5) A clustering unit 305;
and the clustering unit 305 is configured to cluster the multiple program feedback texts based on the target text features, so as to obtain a text category of each program feedback text.
For example, the clustering unit 305 may be specifically configured to obtain density distribution information of the target text feature in the current feature space, calculate feature distances between the target text features in the current feature space, cluster the program feedback text based on the density distribution information and the feature distances, and obtain a text category of each program feedback text, or calculate feature distances between the target text feature and a cluster center of at least one preset category, update the cluster center based on the feature distances, take the updated cluster center as the cluster center, and return to perform the step of calculating the feature distances between the target text feature and the cluster center of at least one preset category until a preset stop condition is reached, and obtain a target category corresponding to each target text.
In the implementation, each unit may be implemented as an independent entity, or may be implemented as the same entity or several entities in any combination, and the implementation of each unit may be referred to the foregoing method embodiment, which is not described herein again.
As can be seen from the above, after obtaining a plurality of program feedback texts of a target application, the embodiment of the application performs text word segmentation on text content in the program feedback texts to obtain at least one text word corresponding to each program feedback text, performs feature extraction on the text word to obtain word features of the text word, then determines association weights of each text word in the program feedback texts based on the word features, performs feature transformation on the word features according to the association weights to obtain target text features of each program feedback text, and then clusters the plurality of program feedback texts based on the target text features to obtain text types of each program feedback text; according to the method and the device, on the basis of acquiring the word characteristics of the full text information of the program feedback text, each association weight in the program feedback text is determined, so that the context information of text words in the program feedback text is introduced, the target text characteristics representing the program feedback text can be extracted more accurately, further, different program feedback texts can be fed back due to differences of running environments and the like in the same problem (category), and meanwhile, differences among similar program feedback texts belonging to different problems can be distinguished, so that the accuracy of text clustering can be improved.
The embodiment of the invention also provides an electronic device, as shown in fig. 10, which shows a schematic structural diagram of the electronic device according to the embodiment of the invention, specifically:
the electronic device may include one or more processing cores 'processors 401, one or more computer-readable storage media's memory 402, power supply 403, and input unit 404, among other components. It will be appreciated by those skilled in the art that the electronic device structure shown in fig. 10 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:
the processor 401 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 402, and calling data stored in the memory 402. Optionally, processor 401 may include one or more processing cores; preferably, the processor 401 may integrate an application processor and a modem processor, wherein the application processor mainly processes an operating system, a user interface, an application program, etc., and the modem processor mainly processes wireless communication. It will be appreciated that the modem processor described above may not be integrated into the processor 401.
The memory 402 may be used to store software programs and modules, and the processor 401 executes various functional applications and data processing by executing the software programs and modules stored in the memory 402. The memory 402 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device, etc. In addition, memory 402 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 402 may also include a memory controller to provide the processor 401 with access to the memory 402.
The electronic device further comprises a power supply 403 for supplying power to the various components, preferably the power supply 403 may be logically connected to the processor 401 by a power management system, so that functions of managing charging, discharging, and power consumption are performed by the power management system. The power supply 403 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.
The electronic device may further comprise an input unit 404, which input unit 404 may be used for receiving input digital or character information and generating keyboard, mouse, joystick, optical or trackball signal inputs in connection with user settings and function control.
Although not shown, the electronic device may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 401 in the electronic device loads executable files corresponding to the processes of one or more application programs into the memory 402 according to the following instructions, and the processor 401 executes the application programs stored in the memory 402, so as to implement various functions as follows:
acquiring a plurality of program feedback texts of a target application, wherein the program feedback texts comprise at least one line of text content, performing text word segmentation on the text content to obtain at least one text word corresponding to each program feedback text, performing feature extraction on the text word to obtain word features of the text word, determining association weights of each text word in the program feedback texts based on the word features, indicating context information of the text word in the program feedback texts, performing feature transformation on the word features according to the association weights to obtain target text features of each program feedback text, and clustering the plurality of program feedback texts based on the target text features to obtain text types of each program feedback text.
For example, the electronic device may directly obtain a plurality of program feedback texts fed back by the target application during running, or extract a plurality of program feedback texts of the target application from a network or a text database; text word segmentation is carried out on the text content to obtain at least one text word corresponding to each program feedback text, feature mapping information corresponding to a target application is obtained, feature mapping is carried out on text word mapping based on the feature mapping information, and word features of each text word are obtained; position coding is carried out on the text words to obtain position coding of the text words, position features of the text words are fused with corresponding word features to obtain current word features, association features are extracted from the current word features, and association weights of each text word in a program feedback text are determined based on the association features; weighting the value features in the associated features based on the associated weights to obtain target word features of text words, and fusing the target word features of the text words in the program feedback text to obtain target text features of the program feedback text; acquiring density distribution information of target text features in a current feature space, calculating feature distances between the target text features in the current feature space, clustering the program feedback texts based on the density distribution information and the feature distances to obtain text types of each program feedback text, or calculating feature distances between the target text features and clustering centers of at least one preset type, updating the clustering centers based on the feature distances, taking the updated clustering centers as the clustering centers, and returning to execute the step of calculating the feature distances between the target text features and the clustering centers of at least one preset type until a preset stop condition is reached, so as to obtain target types corresponding to each target text, and the like.
The specific implementation of each operation may be referred to the previous embodiments, and will not be described herein.
As can be seen from the above, after obtaining a plurality of program feedback texts of a target application, the embodiment of the invention performs text word segmentation on text content in the program feedback texts to obtain at least one text word corresponding to each program feedback text, performs feature extraction on the text word to obtain word features of the text word, then determines association weights of each text word in the program feedback texts based on the word features, performs feature transformation on the word features according to the association weights to obtain target text features of each program feedback text, and then clusters the plurality of program feedback texts based on the target text features to obtain text types of each program feedback text; according to the method and the device, on the basis of acquiring the word characteristics of the full text information of the program feedback text, each association weight in the program feedback text is determined, so that the context information of text words in the program feedback text is introduced, the target text characteristics representing the program feedback text can be extracted more accurately, further, different program feedback texts can be fed back due to differences of running environments and the like in the same problem (category), and meanwhile, differences among similar program feedback texts belonging to different problems can be distinguished, so that the accuracy of text clustering can be improved.
Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.
To this end, embodiments of the present invention provide a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform the steps of any of the text clustering methods provided by the embodiments of the present invention. For example, the instructions may perform the steps of:
acquiring a plurality of program feedback texts of a target application, wherein the program feedback texts comprise at least one line of text content, performing text word segmentation on the text content to obtain at least one text word corresponding to each program feedback text, performing feature extraction on the text word to obtain word features of the text word, determining association weights of each text word in the program feedback texts based on the word features, indicating context information of the text word in the program feedback texts, performing feature transformation on the word features according to the association weights to obtain target text features of each program feedback text, and clustering the plurality of program feedback texts based on the target text features to obtain text types of each program feedback text.
For example, acquiring a plurality of program feedback texts fed back by the target application during running, or extracting a plurality of program feedback texts of the target application from a network or a text database; text word segmentation is carried out on the text content to obtain at least one text word corresponding to each program feedback text, feature mapping information corresponding to a target application is obtained, feature mapping is carried out on text word mapping based on the feature mapping information, and word features of each text word are obtained; position coding is carried out on the text words to obtain position coding of the text words, position features of the text words are fused with corresponding word features to obtain current word features, association features are extracted from the current word features, and association weights of each text word in a program feedback text are determined based on the association features; weighting the value features in the associated features based on the associated weights to obtain target word features of text words, and fusing the target word features of the text words in the program feedback text to obtain target text features of the program feedback text; acquiring density distribution information of target text features in a current feature space, calculating feature distances between the target text features in the current feature space, clustering the program feedback texts based on the density distribution information and the feature distances to obtain text types of each program feedback text, or calculating feature distances between the target text features and clustering centers of at least one preset type, updating the clustering centers based on the feature distances, taking the updated clustering centers as the clustering centers, and returning to execute the step of calculating the feature distances between the target text features and the clustering centers of at least one preset type until a preset stop condition is reached, so as to obtain target types corresponding to each target text, and the like.
The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.
Wherein the computer-readable storage medium may comprise: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.
Because the instructions stored in the computer readable storage medium can execute the steps in any text clustering method provided by the embodiment of the present application, the beneficial effects that any text clustering method provided by the embodiment of the present application can achieve can be achieved, which are detailed in the previous embodiments and are not described herein.
Wherein according to an aspect of the application, a computer program product or a computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The computer instructions are read from a computer-readable storage medium by a processor of an electronic device, and executed by the processor, cause the electronic device to perform the methods provided in various alternative implementations of the text clustering aspect or the error-prone text clustering aspect described above.
The text clustering method, the text clustering device, the electronic equipment and the computer readable storage medium provided by the embodiment of the invention are described in detail, and specific examples are applied to the principle and the implementation of the invention, so that the description of the above embodiment is only used for helping to understand the method and the core idea of the invention; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present invention, the present description should not be construed as limiting the present invention.

Claims (17)

1. A text clustering method, comprising:
acquiring a plurality of program feedback texts of a target application, wherein the program feedback texts comprise at least one line of text content;
text word segmentation is carried out on the text content to obtain at least one text word corresponding to each program feedback text, and feature extraction is carried out on the text word to obtain word features of the text word;
determining an association weight of each text word in the program feedback text based on the word characteristics, wherein the association weight indicates the context information of the text word in the program feedback text;
According to the association weight, carrying out feature transformation on the word features to obtain target text features of the feedback text of each program;
and clustering the multiple program feedback texts based on the target text characteristics to obtain the text category of each program feedback text.
2. The text clustering method according to claim 1, wherein the feature extraction of the text words to obtain word features of the text words includes:
acquiring feature mapping information corresponding to the target application, wherein the feature mapping information is used for mapping the text word into word features;
and carrying out feature mapping on the text words based on the feature mapping information to obtain word features of each text word.
3. The text clustering method according to claim 2, wherein the feature mapping is performed on the text words based on the feature mapping information to obtain word features of each text word, and the method comprises:
acquiring a preset word vector set corresponding to the feature mapping information;
and screening word vectors corresponding to each text word from the preset word vector set based on the feature mapping information, and taking the word vectors as word features of the text words.
4. The text clustering method of claim 1, wherein the determining the associated weight for each text word in the program feedback text based on the word characteristics comprises:
performing position coding on the text word to obtain the position characteristic of the text word;
fusing the position features of the text words with the corresponding word features to obtain current word features;
and extracting associated features from the current word features, and determining the associated weight of each text word in the program feedback text based on the associated features.
5. The text clustering method of claim 4, wherein the associated features include query features and key features, and wherein the determining the associated weight for each text word in the program feedback text based on the associated features comprises:
calculating a relevance value between the text words in the program feedback text based on the query feature and the key feature, wherein the relevance value indicates the relevance degree between the text words in different lines or the same line;
normalizing the relevance value to obtain the attention weight of each text word in the program feedback text, and taking the attention weight as the associated weight of the text word.
6. The text clustering method of claim 4, wherein the associated features further comprise value features, the feature transformation is performed on the word features according to the associated weights to obtain target text features of each program feedback text, and the method comprises:
weighting the value features based on the associated weights to obtain target word features of the text words;
and fusing the target word characteristics of the text words in the program feedback texts to obtain the target text characteristics of each program feedback text.
7. The text clustering method of claim 6, wherein the weighting the value feature based on the association weight to obtain a target word feature of the text word comprises:
weighting the value characteristic based on the association weight to obtain a weighted value characteristic of the text word;
performing feature conversion on the weighted value features, and taking the converted value features as current word features of the text words;
and returning to the step of extracting the related features from the current word features until the preset feature conversion times are reached, and obtaining the target word features of the text words.
8. The text clustering method according to claim 6, wherein the fusing the target word features of the text words in the program feedback text to obtain the target text features of each program feedback text includes:
screening at least one target word feature corresponding to each program feedback text from the target word features to obtain a word feature set corresponding to each program feedback text;
and calculating the feature mean value of the word features in the word feature set to obtain the target text features of the program feedback text.
9. The text clustering method according to claim 6, wherein the fusing the target word features of the text words in the program feedback text to obtain the target text features of each program feedback text includes:
screening at least one target word feature corresponding to each program feedback text from the target word features to obtain a word feature set corresponding to each program feedback text;
acquiring a text position of each text word in the program feedback text, and screening word characteristics corresponding to a preset text position from the word characteristic set based on the text position;
And taking the word characteristic corresponding to the preset text position as the target text characteristic of the program feedback text.
10. The text clustering method according to claim 1, wherein clustering the plurality of program feedback texts based on the target text features to obtain a text category of each program feedback text comprises:
acquiring density distribution information of the target text features in a current feature space;
calculating feature distances among the target text features under the current feature space;
and clustering the program feedback texts based on the density distribution information and the feature distance to obtain the text category of each program feedback text.
11. The text clustering method according to claim 1, wherein clustering the plurality of program feedback texts based on the target text features to obtain a text category of each program feedback text comprises:
calculating the feature distance between the target text feature and at least one clustering center of a preset category, and updating the clustering center based on the feature distance;
taking the updated clustering center as the clustering center, and returning to execute the step of calculating the feature distance between the target text feature and the clustering center of at least one preset category until reaching a preset stopping condition, so as to obtain a target category corresponding to each target text feature;
And taking the target category as the text category of the program feedback text corresponding to the target text characteristic.
12. The text clustering method according to claim 1, wherein the clustering the plurality of program feedback texts based on the target text features, after obtaining the text category of each program feedback text, further comprises:
counting the text quantity of the program feedback text corresponding to each text category in a plurality of program feedback texts;
and sorting the text categories according to the text quantity, and screening target text categories from the text categories based on sorting results.
13. The text clustering method according to claim 1, wherein the text segmentation is performed on the text content to obtain at least one text word corresponding to each program feedback text, and the text clustering method comprises the following steps:
performing content detection on the text content to obtain a content detection result of the text content;
based on the content detection result, adjusting the text content to obtain target text content;
and segmenting the target text content to obtain at least one text word corresponding to each program feedback text.
14. A text clustering device, comprising:
the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a plurality of program feedback texts of a target application, and the program feedback texts comprise at least one line of text content;
the extraction unit is used for carrying out text word segmentation on the text content to obtain at least one text word corresponding to each program feedback text, and carrying out feature extraction on the text word to obtain word features of the text word;
a determining unit, configured to determine, based on the word feature, an association weight of each text word in the program feedback text, where the association weight indicates context information of the text word in the program feedback text;
the feature transformation unit is used for carrying out feature transformation on the word features according to the association weight so as to obtain target text features of the feedback text of each program;
and the clustering unit is used for clustering the multiple program feedback texts based on the target text characteristics to obtain the text category of each program feedback text.
15. An electronic device comprising a processor and a memory, the memory storing an application, the processor configured to run the application in the memory to perform the steps in the text clustering method of any one of claims 1 to 13.
16. A computer program product comprising computer programs/instructions which, when executed by a processor, implement the steps of the text clustering method of any one of claims 1 to 13.
17. A computer readable storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps in the text clustering method of any one of claims 1 to 13.
CN202311231947.9A 2023-09-22 2023-09-22 Text clustering method, text clustering device, electronic equipment and computer readable storage medium Pending CN116975301A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311231947.9A CN116975301A (en) 2023-09-22 2023-09-22 Text clustering method, text clustering device, electronic equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311231947.9A CN116975301A (en) 2023-09-22 2023-09-22 Text clustering method, text clustering device, electronic equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN116975301A true CN116975301A (en) 2023-10-31

Family

ID=88471642

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311231947.9A Pending CN116975301A (en) 2023-09-22 2023-09-22 Text clustering method, text clustering device, electronic equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN116975301A (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150095017A1 (en) * 2013-09-27 2015-04-02 Google Inc. System and method for learning word embeddings using neural language models
US9519871B1 (en) * 2015-12-21 2016-12-13 International Business Machines Corporation Contextual text adaptation
CN109101483A (en) * 2018-07-04 2018-12-28 浙江大学 A kind of wrong identification method for electric inspection process text
CN110008339A (en) * 2019-03-22 2019-07-12 武汉大学 A kind of profound memory network model and its classification method for target emotional semantic classification
CN111079409A (en) * 2019-12-16 2020-04-28 东北大学秦皇岛分校 Emotion classification method by using context and aspect memory information
CN111694935A (en) * 2020-04-26 2020-09-22 平安科技(深圳)有限公司 Multi-turn question and answer emotion determining method and device, computer equipment and storage medium
CN111832248A (en) * 2020-07-27 2020-10-27 科大讯飞股份有限公司 Text normalization method and device, electronic equipment and storage medium
CN112306787A (en) * 2019-07-24 2021-02-02 阿里巴巴集团控股有限公司 Error log processing method and device, electronic equipment and intelligent sound box
CN112579778A (en) * 2020-12-23 2021-03-30 重庆邮电大学 Aspect-level emotion classification method based on multi-level feature attention
CN112989054A (en) * 2021-04-26 2021-06-18 腾讯科技(深圳)有限公司 Text processing method and device
CN114416926A (en) * 2021-07-13 2022-04-29 北京金山数字娱乐科技有限公司 Keyword matching method and device, computing equipment and computer readable storage medium
CN115062143A (en) * 2022-05-20 2022-09-16 青岛海尔电冰箱有限公司 Voice recognition and classification method, device, equipment, refrigerator and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150095017A1 (en) * 2013-09-27 2015-04-02 Google Inc. System and method for learning word embeddings using neural language models
US9519871B1 (en) * 2015-12-21 2016-12-13 International Business Machines Corporation Contextual text adaptation
CN109101483A (en) * 2018-07-04 2018-12-28 浙江大学 A kind of wrong identification method for electric inspection process text
CN110008339A (en) * 2019-03-22 2019-07-12 武汉大学 A kind of profound memory network model and its classification method for target emotional semantic classification
CN112306787A (en) * 2019-07-24 2021-02-02 阿里巴巴集团控股有限公司 Error log processing method and device, electronic equipment and intelligent sound box
CN111079409A (en) * 2019-12-16 2020-04-28 东北大学秦皇岛分校 Emotion classification method by using context and aspect memory information
CN111694935A (en) * 2020-04-26 2020-09-22 平安科技(深圳)有限公司 Multi-turn question and answer emotion determining method and device, computer equipment and storage medium
CN111832248A (en) * 2020-07-27 2020-10-27 科大讯飞股份有限公司 Text normalization method and device, electronic equipment and storage medium
CN112579778A (en) * 2020-12-23 2021-03-30 重庆邮电大学 Aspect-level emotion classification method based on multi-level feature attention
CN112989054A (en) * 2021-04-26 2021-06-18 腾讯科技(深圳)有限公司 Text processing method and device
CN114416926A (en) * 2021-07-13 2022-04-29 北京金山数字娱乐科技有限公司 Keyword matching method and device, computing equipment and computer readable storage medium
CN115062143A (en) * 2022-05-20 2022-09-16 青岛海尔电冰箱有限公司 Voice recognition and classification method, device, equipment, refrigerator and storage medium

Similar Documents

Publication Publication Date Title
US20230039734A1 (en) Systems and methods of data augmentation for pre-trained embeddings
CN116561542B (en) Model optimization training system, method and related device
CN110502677B (en) Equipment identification method, device and equipment, and storage medium
CN110705255A (en) Method and device for detecting association relation between sentences
CN114372475A (en) Network public opinion emotion analysis method and system based on RoBERTA model
CN116361788A (en) Binary software vulnerability prediction method based on machine learning
CN115878750A (en) Information processing method, device, equipment and computer readable storage medium
CN113642727A (en) Training method of neural network model and processing method and device of multimedia information
CN116975711A (en) Multi-view data classification method and related equipment
CN116842936A (en) Keyword recognition method, keyword recognition device, electronic equipment and computer readable storage medium
CN116975301A (en) Text clustering method, text clustering device, electronic equipment and computer readable storage medium
CN115712719A (en) Data processing method, data processing device, computer readable storage medium and computer equipment
CN116415624A (en) Model training method and device, and content recommendation method and device
CN113705253A (en) Machine translation model performance detection method and related equipment
CN113704422A (en) Text recommendation method and device, computer equipment and storage medium
CN113821632A (en) Content classification method and device, electronic equipment and computer-readable storage medium
CN116992031B (en) Data processing method, device, electronic equipment, storage medium and program product
CN114707633B (en) Feature extraction method, device, electronic equipment and storage medium
US12032605B2 (en) Searchable data structure for electronic documents
US11727215B2 (en) Searchable data structure for electronic documents
US20230153335A1 (en) Searchable data structure for electronic documents
CN117216374A (en) Content recommendation method, content recommendation device, computer readable storage medium and computer equipment
US11934794B1 (en) Systems and methods for algorithmically orchestrating conversational dialogue transitions within an automated conversational system
CN117436930A (en) Object loss prediction method and device, electronic equipment and storage medium
CN116510306A (en) Game information processing method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination