CN113254640A

CN113254640A - Work order data processing method and device, electronic equipment and storage medium

Info

Publication number: CN113254640A
Application number: CN202110582019.1A
Authority: CN
Inventors: 易存道
Original assignee: Beijing Baolande Software Co ltd
Current assignee: Beijing Baolande Software Co ltd
Priority date: 2021-05-27
Filing date: 2021-05-27
Publication date: 2021-08-13

Abstract

The invention provides a work order data processing method, a work order data processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring text data to be processed, wherein the text data to be processed refers to work order text data to be clustered; performing word segmentation and vectorization processing on the text data to be processed to obtain vector data of the text data to be processed; performing clustering analysis processing on the vector data based on a locality sensitive hashing algorithm, and dividing the vector data into a plurality of candidate sets; calculating the distance between each vector data in the candidate set; and when the distance between the vector data is confirmed to belong to a preset threshold range, classifying the vector data into the same category. The work order data processing method provided by the invention can be applied to the clustering processing of mass work order data, the speed of work order data processing is improved, and the timeliness of work order data processing is ensured.

Description

Work order data processing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a method and an apparatus for processing work order data, an electronic device, and a storage medium.

Background

Along with the continuous application of artificial intelligence, the demand of people on the intelligent analysis and processing of work order data increases day by day, and the work amount of customer service personnel can be greatly reduced and the work efficiency is improved by intelligently analyzing and processing the work order data.

At present, the basic requirements of intelligent processing of work order data are to realize clustering of work order data and find hot complaints, hot problems or hot options and the like in numerous work order data information, and when the work order data clustering is completed, the hot complaints, the hot problems or the hot options are also found. In the prior art, the work order text data are often clustered based on tf-idf algorithm and cosine similarity, the clustering processing mode is suitable for the condition of small work order data volume, the processing effect is not ideal when a large amount of work order data are processed, large time consumption is generated in the execution process, the timeliness of work order data processing is low, and the user experience effect is poor. In addition, the description of the work order content in the prior art has no uniform standard, and the clustering effect of the work orders is easily influenced.

Disclosure of Invention

The invention provides a method and a device for processing work order data, electronic equipment and a storage medium, which are used for solving the technical problems that massive work order data cannot be processed well and the description of work order contents does not have a unified standard in the prior art so as to achieve the purpose of improving the processing speed of the massive work order data.

In a first aspect, the present invention provides a method for processing work order data, including:

acquiring text data to be processed, wherein the text data to be processed refers to work order text data to be clustered;

performing word segmentation and vectorization processing on the text data to be processed to obtain vector data of the text data to be processed;

performing clustering analysis processing on the vector data based on a locality sensitive hashing algorithm, and dividing the vector data into a plurality of candidate sets;

calculating the distance between each vector data in the candidate set;

and when the distance between the vector data is confirmed to belong to a preset threshold range, classifying the vector data into the same category.

According to the method for processing the work order data provided by the invention, the method further comprises the following steps:

and confirming the sequence of each category according to the quantity of the obtained vector data in each category.

According to the work order data processing method provided by the invention, the text data to be processed is obtained from a field lexicon, wherein the field lexicon is formed by splicing work order titles and solutions.

According to the method for processing the work order data, provided by the invention, the clustering analysis processing is carried out on the vector data based on the locality sensitive hashing algorithm, and the method comprises the following steps:

mapping high-dimensional vector data in the vector data based on a hash function to obtain low-dimensional vector data;

and carrying out barrel division processing on the low-dimensional vector data to obtain the plurality of candidate sets.

According to the processing method of the work order data provided by the invention, before the word segmentation processing is carried out on the text data to be processed, the processing method comprises the following steps:

and removing the special characters in the text data to be processed, replacing the characters influencing the clustering effect with special marks, and acquiring the cleaned text data to be processed.

According to the processing method of the work order data provided by the invention, the word segmentation processing is carried out on the text data to be processed, and the word segmentation processing comprises the following steps:

and performing word segmentation on the cleaned text data to be processed based on a word segmentation tool to obtain each word segmentation of the text data to be processed.

According to the method for processing work order data provided by the invention, the vectorization processing of the text data to be processed comprises the following steps:

vectorizing the text data to be processed after word segmentation based on a word frequency-inverse document frequency mode to obtain the vector data.

In a second aspect, the present invention provides a processing apparatus for work order data, including:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring text data to be processed, and the text data to be processed refers to work order text data to be clustered;

the processing module is used for cleaning, word segmentation and vectorization processing on the text data to be processed to obtain vector data of the text data to be processed;

the clustering module is used for carrying out clustering analysis processing on the vector data based on a locality sensitive hashing algorithm and dividing the vector data into a plurality of candidate sets;

a calculation module for calculating the distance between each vector data in the candidate set;

and the classification module is used for classifying the vector data into the same category when the distance between the vector data is confirmed to belong to a preset threshold range.

The third invention also provides an electronic device comprising: a processor, a memory, and a bus, wherein,

the processor and the memory are communicated with each other through the bus;

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of the above.

In a fourth aspect, the invention also provides a non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of the above.

The method comprises the steps of obtaining text data to be processed, carrying out word segmentation and vectorization processing on the text data to be processed, obtaining vector data of the text data to be processed, carrying out cluster analysis processing on the vector data based on a local sensitive hash algorithm, dividing the vector data into a plurality of candidate sets, calculating the distance between each vector data in the candidate sets, and classifying each vector data into the same category when the distance between each vector data is confirmed to belong to a preset threshold range. The work order data processing method provided by the invention can be used for clustering processing of mass work order data, the speed of work order data processing is improved, and the timeliness of work order data processing is ensured.

Drawings

In order to more clearly illustrate the technical solutions of the present invention or the prior art, the drawings needed for the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a work order data processing method according to the present invention;

FIG. 2 is a schematic structural diagram of a work order data processing apparatus according to the present invention;

fig. 3 is a schematic structural diagram of an electronic device provided in the present invention.

Detailed Description

In order to make the solution of the embodiment of the present invention easier to understand and better reflect the difference from the existing work order data clustering process, a basic work order clustering method in the prior art is first briefly described below.

In the prior art, clustering work order text data is generally realized based on tf-idf algorithm and cosine similarity, and the processing mode is suitable for the situation of small data volume, however, the number of complaint work orders in many fields is not limited to below million, because the number of users in many fields is huge, such as hundreds of millions of users, even if the number of complaint work orders per day is very small, after a period of time of accumulation, the text data of the complaint work orders is likely to reach a quite huge number, the data volume is large, so that the intuitive calculation amount is increased, and the calculation amount is increased due to the increase of the dimensionality of the text data, so that the execution efficiency is affected and the timeliness is reduced; on the other hand, data of a field of work order content is often adopted in the prior art, and data records of the field are not standard, so that the clustering effect of the work order text data is not ideal. The invention provides a method for processing work order data, which is used for solving the technical problems in the prior art.

Fig. 1 is a schematic flow chart of a work order data processing method provided by the present invention. As shown in fig. 1, the work order data processing method provided by the present invention includes:

step 101: acquiring text data to be processed, wherein the text data to be processed refers to work order text data to be clustered;

step 102: performing word segmentation and vectorization processing on the text data to be processed to obtain vector data of the text data to be processed;

step 103: performing clustering analysis processing on the vector data based on a locality sensitive hashing algorithm, and dividing the vector data into a plurality of candidate sets;

step 104: calculating the distance between each vector data in the candidate set;

step 105: and when the distance between the vector data is confirmed to belong to a preset threshold range, classifying the vector data into the same category.

Specifically, local-Sensitive Hashing (LSH) is an approximate nearest neighbor fast search technique for massive high-dimensional data, and the basic idea of LSH is that after two adjacent data points in an original data space are subjected to the same mapping or projection transformation, the probability that the two data points are still adjacent in a new data space is very high, and non-adjacent data points are still not adjacent after the same mapping or projection transformation.

In step 101, the text data to be processed refers to work order text data to be clustered, for example, the text data to be processed may be complaint work order text data to be clustered, where clustering refers to dividing a set of data units into several subsets called clusters or categories, where data in each category has similarity, and the basis of the division is the category of things.

In step 102, performing word segmentation and vectorization on the text data to be processed to obtain vector data of the text data to be processed, where word segmentation refers to splitting the text data into a plurality of words, and performing vectorization on each word based on a vectorization processing manner to obtain vector data corresponding to the text data to be processed.

In step 104, the Distance between each vector data in the candidate set is calculated, and in this embodiment, the Distance between each vector is preferably a Jaccard Distance (Jaccard Distance), where the Jaccard Distance is an index for measuring the difference between two sets. The distance calculation method may be different calculation methods according to actual requirements, and is not specifically limited herein.

In step 105, when it is determined that the distance between each vector data in the candidate set belongs to a preset threshold range, each vector data is classified into the same category, where the preset threshold range may be set to 0.7-0.8, and may be specifically set according to actual needs, and is not specifically limited herein.

In the embodiment, text data to be processed is obtained, word segmentation and vectorization are carried out on the text data to be processed to obtain vector data, clustering analysis is carried out on the vector data based on a locality sensitive hashing algorithm, and the vector data is divided into a plurality of candidate sets; and calculating the distance between the vector data in the candidate set, and classifying the vector data with the distance within a preset threshold range into the same category.

In the embodiment of the invention, the massive data is divided into a plurality of candidate sets based on the locality sensitive hashing algorithm, and then the distance calculation is carried out on the vector data in each candidate set, so that the calculation complexity is reduced, the work order data processing speed is increased, and the accuracy and the timeliness of the work order data are ensured.

In another embodiment of the present invention, the order of the respective categories is confirmed according to the number of vector data in the acquired respective categories.

Specifically, the order of the categories is determined according to the number of vector data in each category, for example, the category with the largest number of vector data is determined as the first category, the second category is determined as the second category, and so on.

In this embodiment, a hot complaint can be determined according to the sequence of each category, where the hot complaint refers to a problem that the frequency of customer complaints is relatively high and often occurs, and there may be a plurality of hot complaints, for example, problems that a mobile phone cannot be connected to a telephone, a mobile phone cannot be connected to a network, and the network is always disconnected. The hot problem, the hot option, etc. may also be determined according to the sequence of each category, and are not specifically limited herein.

In the embodiment of the invention, the hot complaints in the corresponding fields are confirmed according to the quantity of the vector data in each acquired category, and the categories with more data quantity obtained through statistics are returned to the front end to be used as the hot complaint display, so that customer service staff can be helped to find out corresponding solutions in time according to the problems of customers, and the working efficiency is improved.

In another embodiment of the present invention, the text data to be processed is obtained from a domain lexicon, wherein the domain lexicon is formed by data spliced by work order titles and solutions.

Specifically, in order to realize standardization of work order data, two field data of a "work order title" and a "solution" in complaint work order text data are spliced, and the spliced data form a field lexicon. The expression of the two field data of the work order title and the solution tends to be standardized, keywords for expressing the theme are not lost, and the information of the corresponding work order data can be accurately represented.

In this embodiment, the text data to be processed is obtained from a domain lexicon, where the domain lexicon is formed by data obtained by splicing two field data, i.e., a work order title and a solution.

For example, complaint work order text data (assuming there are 2000) is obtained, which is roughly as follows: { work order title: [ customer complaint signal difference ], work order content: [ user A incoming call, complaint mobile phone can not get through the call, can not receive the message, has already seriously influenced its work and life, want to keep pace with and help him to solve, address: ] solution: after verification, the area is currently under maintenance and the customer is informed to wait for the completion of the maintenance. The work order text data is analyzed, the content of the field 'work order content' shows a lot of information irrelevant to the complaint content, and the clustering effect of the work order text data is easily influenced.

According to the method, the data content obtained by splicing the two fields of the work order title and the solution is adopted, all the data obtained after splicing processing form the field word stock, the method can ensure that the keywords of the work order theme cannot be lost, the work order title and the solution are accurately expressed, and the information of the complaint work order can be accurately represented.

In another embodiment of the present invention, the performing a cluster analysis process on the vector data based on a locality sensitive hashing algorithm includes:

In particular, a Hash Function (Hash Function) refers to a Function that maps key keys of elements in a Hash table to storage locations of the elements.

In this embodiment, high-dimensional vector data in the vector data is mapped based on a hash function to obtain low-dimensional vector data, and the low-dimensional vector data is subjected to bucket division to obtain a plurality of candidate sets. In the embodiment, the calculation amount of the text vector dimension is reduced, and the speed of processing the work order data is increased. The following specific examples are discussed.

For example, for high-dimensional vector data, a special hash function is used to map two data with high similarity into the same hash value with high probability, and map two data with low similarity into the same hash value with very low probability, so as to realize the conversion from the high-dimensional data to the low-dimensional data.

And low-dimensional vector data are subjected to bucket division based on LSH, so that the intuitive calculated amount is reduced. For example, if the data consists of 10 pieces of vector data, the similarity between the vector data may be calculated 10 × 10 times, and LSH is used to reduce the calculation amount. The specific method comprises the following steps: if each vector data has 100 dimensions, the dimensions are divided, for example, the vector data is divided into 5 parts, each part corresponds to 20 dimensions, the sequence numbers of the 5 parts are 1, 2, 3, 4 and 5, comparative analysis is carried out according to the sequence, if the corresponding characteristics of the two vector data in the No. 1 bucket are the same, the two vector data are divided into one bucket, and if the corresponding characteristics are not the same, the No. 2 is continuously seen until the No. 5 is finished. Wherein each bucket represents a candidate set, and when the candidate set is divided into buckets, the two vector data can be considered to be correlated.

In the embodiment, after the buckets are divided, the candidate sets are divided, and when the distance between each pair of vector data is calculated subsequently, only the distance between each pair of vector data in each candidate set needs to be calculated, so that the calculation amount is greatly reduced.

In another embodiment of the present invention, before performing word segmentation processing on the text data to be processed, the method includes:

Specifically, the data cleaning means that some characters which do not meet the requirements of subsequent word segmentation processing are removed or replaced by other methods, so that the subsequent word segmentation is ensured to be smoothly performed.

In the embodiment, special characters such as punctuation marks, space marks and the like in the text data to be processed are removed; and replacing characters influencing the clustering effect with special marks to obtain the cleaned text data to be processed, for example, replacing numbers, dates or longer non-Chinese character strings in the text data to be processed with special words, such as number, date, order number, jobnumber and the like.

In the embodiment, some text data which cannot be subjected to word segmentation directly is cleaned, and the condition of word segmentation is met after cleaning, so that smooth processing of work order data clustering is ensured.

In an embodiment of the present invention, the performing word segmentation processing on the text data to be processed includes:

Specifically, the word segmentation tool refers to a tool for performing word segmentation processing on text data, and the types of word segmentation tools are many, such as a Hanlp word segmenter, a Chinese word segmentation, an LTP word segmentation, a KCWS word segmenter, and the like.

In this embodiment, preferably, the word segmentation is performed on the cleaned text data to be processed, so as to obtain each word segmentation of the text data to be processed, and after the word segmentation is performed on the text data to be processed, the work order text data can be better vectorized, so that the purpose of clustering the work order data is achieved.

In another embodiment of the present invention, the vectorizing processing of the text data to be processed includes:

Specifically, tf-idf (term frequency-inverse document frequency) vectorization, where tf (term frequency) refers to the number of times a certain word appears in a document where the certain word is located divided by the total number of words in the document where the certain word is located; idf (inverse document frequency) is a measure describing the general importance of a word, with the idf of a word being equal to the total number of documents divided by the number of documents containing the word, and then logarithmically taken, i.e., tf-idf = tf.

In this embodiment, a word frequency-inverse document frequency mode is adopted to perform vectorization processing on text data to be processed, which has been subjected to word segmentation processing, so as to obtain vectorized text data. It should be noted that, in the present embodiment, the vectorization processing is performed by using a word frequency-inverse document frequency manner, which is not limited to this manner, and different vectorization processing manners may be selected according to actual needs, and are not limited in detail here.

For example, if the data set has 100 documents, of which 10 documents contain the word 'server', and the word 'server' appears 3 times in the a document, and there are 100 words in the a document, the word frequency of the 'server' in the a document is 3/100=0.03, and the inverse document frequency is log (100/10) =1, so the tf-idf value of the word 'server' in the a document is 0.03 × 1 = 0.03.

For another example, a text vector is constructed using tf-idf: if there are two documents in a data set b [ linux system is generally used for servers, windows system is generally used for individuals ], all documents in the data set b are firstly participled and counted to obtain a vocabulary [ linux, system, generally, for server, windows, person ], it can be seen that there are 7 words in the vocabulary, i.e. the length is 7, the length of the constructed text vector needs to be the same as the length of the vocabulary, so taking the case that the 'linux system is commonly used in server's sentence, the word is segmented first to get [ linux, system, general, for server, the vector of this sentence is [ tfidf (linux), tfidf (system), tfidf (general), tfidf (for), tfidf (server), 0, 0], in order to keep the length of the vocabulary consistent, corresponding to the previous vocabulary, 'windows' and 'person', and not in this sentence, the corresponding position is complemented to 0; the vector for the word 'windows system is commonly used for individuals' can also be found to be [0, tfidf (system), tfidf (general), tfidf (for), 0, tfidf (windows), tfidf (individuals) ].

In the embodiment of the invention, the vectorization processing is carried out on the text data after the word segmentation processing to obtain the vector data, so that the method can be better applied to the clustering processing, and the processing speed of the work order data is improved.

Fig. 2 is a processing apparatus of work order data according to the present invention, and as shown in fig. 2, the processing apparatus of work order data according to the present invention includes:

the acquiring module 201 is configured to acquire text data to be processed, where the text data to be processed refers to work order text data to be clustered;

the processing module 202 is configured to perform word segmentation and vectorization on the to-be-processed text data to obtain vector data of the to-be-processed text data;

the clustering module 203 is configured to perform clustering analysis processing on the vector data based on a locality sensitive hashing algorithm, and divide the vector data into a plurality of candidate sets;

a calculating module 204, configured to calculate a distance between each vector data in the candidate set;

and the classifying module 205 is configured to classify the vector data into the same category when it is determined that the distance between the vector data belongs to a preset threshold range.

Specifically, the vectorization processing refers to performing word vector conversion on the text data to be processed after word segmentation to obtain a word vector of each word.

In the embodiment of the invention, the text data to be processed is obtained through the obtaining module, and the processing module carries out word segmentation and vectorization processing on the text data to be processed to obtain vector data; the clustering module carries out clustering analysis on the vector data based on a locality sensitive hashing algorithm, the vector data are divided into a plurality of candidate sets, the calculating module is used for calculating the distance between the vector data in each candidate set, and the classifying module is used for classifying the vector data into the same category when the distance between the vector data belongs to a preset threshold range. The device can be applied to massive work order data, reduces the complexity of calculation, improves the processing speed of the work order data, and ensures the timeliness of work order data processing.

Since the principle of the apparatus according to the embodiment of the present invention is the same as that of the method according to the above embodiment, further details are not described herein for further explanation.

Fig. 3 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 3, the present invention provides an electronic device, including: a processor (processor)301, a memory (memory)302, and a bus 303;

wherein, the processor 301 and the memory 302 complete the communication with each other through the bus 303;

processor 301 is configured to call program instructions in memory 302 to perform the methods provided by the various method embodiments described above, including, for example: acquiring text data to be processed, wherein the text data to be processed refers to work order text data to be clustered; performing word segmentation and vectorization processing on the text data to be processed to obtain vector data of the text data to be processed; performing clustering analysis processing on the vector data based on a locality sensitive hashing algorithm, and dividing the vector data into a plurality of candidate sets; calculating the distance between each vector data in the candidate set; and when the distance between the vector data is confirmed to belong to a preset threshold range, classifying the vector data into the same category.

The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause the computer to perform the methods provided by the above method embodiments, for example, including: acquiring text data to be processed, wherein the text data to be processed refers to work order text data to be clustered; performing word segmentation and vectorization processing on the text data to be processed to obtain vector data of the text data to be processed; performing clustering analysis processing on the vector data based on a locality sensitive hashing algorithm, and dividing the vector data into a plurality of candidate sets; calculating the distance between each vector data in the candidate set; and when the distance between the vector data is confirmed to belong to a preset threshold range, classifying the vector data into the same category.

Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for processing work order data is characterized by comprising the following steps:

calculating the distance between each vector data in the candidate set;

2. The method of processing work order data as claimed in claim 1, further comprising:

and confirming the sequence of each category according to the quantity of the acquired vector data in each category.

3. The method for processing work order data according to claim 1, wherein the text data to be processed is obtained from a domain thesaurus, wherein the domain thesaurus is composed of data spliced by work order titles and solutions.

4. The method for processing work order data according to claim 1, wherein the cluster analysis processing of the vector data based on the locality sensitive hashing algorithm comprises:

5. The method for processing work order data according to claim 1, wherein before performing word segmentation processing on the text data to be processed, the method comprises:

6. The method for processing work order data according to claim 5, wherein the performing word segmentation processing on the text data to be processed comprises:

7. The method for processing work order data according to claim 6, wherein the vectorizing the text data to be processed includes:

8. A work order data processing apparatus, comprising:

the processing module is used for performing word segmentation and vectorization processing on the text data to be processed to obtain vector data of the text data to be processed;

9. An electronic device, comprising: a processor, a memory, and a bus, wherein,

the processor and the memory are communicated with each other through the bus;

the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 7.

10. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1-7.