CN106528768A - Consultation hotspot analysis method and device - Google Patents

Consultation hotspot analysis method and device Download PDF

Info

Publication number
CN106528768A
CN106528768A CN201610974447.8A CN201610974447A CN106528768A CN 106528768 A CN106528768 A CN 106528768A CN 201610974447 A CN201610974447 A CN 201610974447A CN 106528768 A CN106528768 A CN 106528768A
Authority
CN
China
Prior art keywords
document
consulting
word
center
initial clustering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610974447.8A
Other languages
Chinese (zh)
Inventor
陈雨泽
刘建
赵加奎
朱平飞
欧阳红
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
Beijing China Power Information Technology Co Ltd
Original Assignee
State Grid Corp of China SGCC
State Grid Information and Telecommunication Co Ltd
Beijing China Power Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Corp of China SGCC, State Grid Information and Telecommunication Co Ltd, Beijing China Power Information Technology Co Ltd filed Critical State Grid Corp of China SGCC
Priority to CN201610974447.8A priority Critical patent/CN106528768A/en
Publication of CN106528768A publication Critical patent/CN106528768A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a consultation hotspot analysis method and device. The method comprises the steps of extracting k consultation documents from a plurality of consultation documents and taking the k consultation documents as initial clustering centers of k document classes; calculating similarities between other consultation documents and each of document classes; and classifying the consultation documents into the document classes corresponding to the similarity maximums, thereby realizing a process of automatically classifying the obtained plurality of consultation documents; carrying out keyword extraction on one class of classified consultation documents, obtaining the statistic information of the class of consultation documents, and obtaining consultation hotspots corresponding to the consultation documents. Compared with the technical scheme in the prior art that the consultation hotspots only can be obtained after classification and statistics are carried out on the consultation documents artificially and then the consultation hotspots are analyzed, the method and the device have the advantage of improving the consultation document classification efficiency.

Description

A kind of consulting analysis of central issue method and device
Technical field
The present invention relates to intellectualized technology field, more particularly to a kind of consulting analysis of central issue method and device.
Background technology
In prior art, Guo Wang client service centers are obtained by modes such as 95598 customer service hotlines, mobile phone app, notes first The reference content of client is taken, then typing reference content consulting document is generated.After consulting document structure tree, the related clothes such as operator As a result business personnel import data base to sorting out to seeking advice from document according to the actual Category of consulting of client.According in data base Data genaration reference content, Category of consulting and consulting number of documents of all categories statistical information.According to the system of consulting document Meter information, obtains consulting focus, and the consulting focus for obtaining is analyzed.
But, as client's reference content is day by day various, consulting number of documents is growing, rely solely on manual type pair Substantial amounts of consulting document is classified, and then sorted consulting document is counted, and then obtains seeking advice from focus, and to consulting Ask focus to be analyzed, lead to not realize that the problem being efficiently analyzed to seeking advice from focus is produced.
The content of the invention
The present invention provides a kind of consulting analysis of central issue method and device, to solve cannot to realize in prior art efficiently, Comprehensively, in time to seeking advice from the problem that focus is analyzed.
For achieving the above object, the technical scheme is that:
The invention provides a kind of consulting analysis of central issue method, including:
Obtain multiple consulting documents;
K consulting document is extracted from the plurality of consulting document, using described k consulting document as k document The center of the initial clustering of classification;Wherein, k is positive integer;
Other each consulting documents and each institute in addition to k is seeked advice from document is calculated in the plurality of consulting document respectively State the similarity between the center of initial clustering;
The similarity of other each the consulting documents in the plurality of consulting document of acquisition in addition to k is seeked advice from document In, the center of the corresponding initial clustering of the similarity maximum;
By other each the consulting documents in addition to described k is seeked advice from document, the similarity maximum correspondence is categorized into The initial clustering center place document classification in;
The key word of each consulting document in the document classification is extracted, the corresponding consulting heat of the document classification is obtained Point;
The consulting focus is analyzed.
Preferably, it is described calculate in the plurality of consulting document respectively in addition to k is seeked advice from document other each seek advice from Similarity between the center of document and each initial clustering includes:
Word segmentation processing is carried out respectively to document is seeked advice from each described, is obtained corresponding to multiple official communications that document is seeked advice from each described Ask word;
Key word is extracted respectively from seeking advice from word each described, is obtained corresponding to the key that document is seeked advice from each described Word;
According to the key word, calculate respectively in the plurality of consulting document in addition to k consulting document other each Similarity between consulting document and the center of each initial clustering.
Preferably, it is described to carry out word segmentation processing respectively to document is seeked advice from each described, obtain corresponding to consulting each described Multiple consulting words of document include:
The original character string included to seeking advice from document each described carries out atom cutting, obtains atom cutting result;
The thick cutting of N- shortest paths is carried out to the atom cutting result, N number of word segmentation result is obtained;N number of participle knot Fruit is stored in the form of binary participle table;Wherein, there is between the word for including in each described word segmentation result connectivity;
Calculate positioned at the word of described binary participle table one end and between the word of the binary participle table other end First distance in all paths for existing;
The word that described first is included in the corresponding path of minima is used as consulting word.
Preferably, it is described respectively to extract key word in consulting word from each described, obtain corresponding to consulting each described The key word of document includes:
The number of times that each consulting word occurs in the consulting document is counted respectively;
The number of times that described each consulting word occurs in the consulting document is standardized, obtains described every The word frequency of individual consulting word;
Counted in a corpus respectively, including the number of the document of each consulting word;
The institute of each consulting word is included by the total and described document of document described in the corpus The number of document is stated, the inverse document frequency of each consulting word is calculated respectively;
The word frequency of each consulting word is multiplied with the inverse document frequency of each consulting word, obtains described every The frequency result of calculation of individual consulting word;
In choosing the frequency result of calculation, consulting word corresponding more than the frequency result of calculation of predetermined threshold value is described The key word of consulting document.
Preferably, it is described according to the key word, calculate respectively it is the plurality of consulting document in except k consulting document in addition to Other each similarities between consulting documents and the center of each initial clustering include:
Acquisition needs all key words that two consulting documents for calculating similarity include;Wherein, two consulting documents In one be a consulting document in k of the center as initial clustering consulting document, another is to seek advice from document except k Outside one consulting document;
An official communication of all key words in the k consulting document at the center as initial clustering is calculated respectively The number of times occurred in asking document;
According to a consulting text of all key words in the k consulting document at the center as initial clustering The number of times occurred in shelves, obtains described in a consulting document in the k consulting document at the center as initial clustering The word frequency vector of all key words;
The number of times that all key words occur in a consulting document in addition to k is seeked advice from document is calculated respectively;
According to the number of times that all key words occur in a consulting document except in k consulting document, obtain To the word frequency vector of all key words described in a consulting document except in k consulting document;
Using the cosine law, a consulting document in the k consulting document at the center as initial clustering is calculated Described in all key words word frequency it is vectorial with it is described except k consulting document in one consulting document described in all key words Word frequency vector between included angle cosine value;
Wherein, the included angle cosine value represents an official communication in the k consulting document at the center as initial clustering Ask the similarity between document and a consulting document in addition to k is seeked advice from document.
Preferably, other each consulting documents in the plurality of consulting document of the acquisition in addition to k is seeked advice from document The similarity in, the center of the corresponding initial clustering of the similarity maximum includes:
Described other each consulting documents in addition to k is seeked advice from document are calculated respectively to each institute using the similarity State the second distance at the center of initial clustering;
Obtain the center of the corresponding initial clustering of the second distance minima;
Wherein, the second distance is less, and the similarity is bigger.
Preferably, described other each consulting documents by addition to described k is seeked advice from document, are categorized into the similarity The center place document classification of the corresponding initial clustering of maximum includes:
Described other each consulting documents in addition to k is seeked advice from document are divided into and the second distance minima pair In the center place document classification of the initial clustering answered;
Judge in addition to the center for being chosen for initial clustering, whether other each consulting documents have been respectively divided k In document classification;
When judged result is to be, the other center of each document class in k document classification is recalculated, obtain k first The center of cluster;
Whether the center for being respectively compared the k first cluster is identical with the center of k initial clustering;
If it is different, then using the center of the first clusters of the k as the other new cluster centre of k document class;
The plurality of consulting document is calculated respectively to the 3rd distance of new cluster centre each described using the similarity;
The plurality of consulting document is divided into and is located apart from the corresponding new cluster centre of minima with the described 3rd In document classification;
Judge whether the plurality of consulting document has all been respectively divided in k document classification;
When judged result is to be, the other center of each document class in k document classification is recalculated, obtain k second The center of cluster;
Whether the center for being respectively compared the k second cluster is identical with the center of last cluster;
If it is different, then using the center of the second clusters of the k as the other new cluster centre of k document class;
Return to perform the plurality of consulting document is calculated respectively to new cluster centre each described using the similarity 3rd distance.
Present invention also offers a kind of consulting analysis of central issue device, including:
First acquisition unit, for obtaining multiple consulting documents;
First extraction unit, for k consulting document is extracted from the plurality of consulting document, by described k consulting text Center of the shelves respectively as the other initial clustering of k document class;Wherein, k is positive integer;
First computing unit, for calculate in the plurality of consulting document respectively in addition to k consulting document other are every Similarity between individual consulting document and the center of each initial clustering;
Second acquisition unit, for obtaining other each official communications in the plurality of consulting document in addition to k is seeked advice from document Ask in the similarity of document, the center of the corresponding initial clustering of the similarity maximum;
First taxon, for by other each the consulting documents in addition to described k is seeked advice from document, being categorized into described In the center place document classification of the corresponding initial clustering of similarity maximum;
Second extraction unit, for extracting the key word of each consulting document in the document classification, obtains the document The corresponding consulting focus of classification;
Analytic unit, for being analyzed to the consulting focus.
Preferably, also include:
Word segmentation processing unit, for carrying out word segmentation processing respectively to document is seeked advice from each described, obtains corresponding to each institute State multiple consulting words of consulting document;
Keyword extracting unit, for respectively extracting key word in consulting word from each described, obtains corresponding to each The key word of the consulting document;
First computing unit, for according to the key word, calculating respectively in the plurality of consulting document except k is consulted The similarity between other each consulting documents and the center of each initial clustering outside inquiry document.
Preferably, the word segmentation processing unit includes:
Atom cutting unit, the original character string for including to seeking advice from document each described carry out atom cutting, obtain To atom cutting result;
The thick cutting unit of shortest path, for carrying out the thick cutting of N- shortest paths to the atom cutting result, obtains N number of Word segmentation result;
Second computing unit, for the word for calculating the word of binary participle table one end be located at the binary participle table other end Between first distance in all paths that exists;
Determining unit, for described first is determined from the described first distance apart from minima, by described first apart from most The word included in the corresponding path of little value is used as consulting word.
Preferably, the keyword extracting unit includes:
First statistic unit, for counting the number of times that each consulting word occurs in the consulting document respectively;
Standardisation Cell, for being standardized to the number of times that described each consulting word occurs in the consulting document Process, obtain the word frequency of each consulting word;
Second statistic unit, for respectively count a corpus in, including it is described each consulting word document Number;
Inverse document frequency computing unit, for wrapping in the total and described document by document described in the corpus The number of the document of each consulting word is included, the inverse document frequency of each consulting word is calculated respectively;
Frequency computing unit, for by the inverse document of the word frequency of each consulting word and each consulting word frequently Rate is multiplied, and obtains the frequency result of calculation of each consulting word;
Unit is chosen, for choosing the corresponding official communication of frequency result of calculation in the frequency result of calculation more than predetermined threshold value Ask the key word that word is the consulting document.
Preferably, first computing unit includes:
3rd acquiring unit, for obtaining all key words for needing two consulting documents for calculating similarity to include; Wherein, in two consulting documents one be that in k of the center as initial clustering consulting document seeks advice from document, it is another Individual is a consulting document in addition to k is seeked advice from document;
3rd computing unit, it is individual for calculating k of all key words at the center as initial clustering respectively The number of times occurred in a consulting document in consulting document;
Word frequency vector location, seeks advice from for k according to all key words at the center as initial clustering The number of times occurred in a consulting document in document, obtains in the k consulting document at the center as initial clustering The word frequency vector of all key words described in one consulting document;
Included angle cosine value computing unit, for utilizing the cosine law, calculates the k official communication at the center as initial clustering Ask document in one consulting document described in all key words word frequency it is vectorial with it is described except k consulting document in one consult Ask the included angle cosine value between the word frequency vector of all key words described in document;
Wherein, the included angle cosine value represents an official communication in the k consulting document at the center as initial clustering Ask the similarity between document and a consulting document in addition to k is seeked advice from document.
Preferably, the second acquisition unit includes:4th computing unit, for calculating described removing respectively using similarity Second distance of other each the consulting documents outside k consulting document to the center of each initial clustering;
4th acquiring unit, for obtaining the center of the corresponding initial clustering of the second distance minima;
Wherein, the second distance is less, and the similarity is bigger.
Preferably, first taxon includes:
Second taxon, for described other each consulting documents in addition to k is seeked advice from document are divided into and institute State in the center place document classification of the corresponding initial clustering of second distance minima;
Judging unit, for judging in addition to the center for being chosen for initial clustering, whether other each consulting documents It is divided in k document classification;5th computing unit, for when the judged result of the judging unit is to be, recalculating k The other center of each document class in individual document classification, obtains the center of k first cluster;
First comparing unit, for be respectively compared the center of the first clusters of the k and k initial clustering center whether It is identical;
6th computing unit, for when the comparative result of first comparing unit is different, the k first being gathered The center of class calculates the plurality of consulting document respectively using the similarity and arrives as the other new cluster centre of k document class 3rd distance of each new cluster centre;
3rd taxon, for by the plurality of consulting document be divided into the described 3rd apart from the corresponding institute of minima State in new cluster centre place document classification;
Second judging unit, for judging whether the plurality of consulting document has all been respectively divided k document classification In;
7th computing unit, for when the judged result of second judging unit is to be, recalculating k document class The other center of each document class in not, obtains the center of k second cluster;
Second comparing unit, whether the center that the center and last time for being respectively compared the k second cluster clusters It is identical;
6th computing unit, is additionally operable to when the comparative result of second comparing unit is different, by the k The center of the second cluster is used as the other new cluster centre of k document class;The plurality of consulting is calculated respectively using the similarity Threeth distance of the document to new cluster centre each described.
Understand via above-mentioned technical scheme, compared with prior art, the application is by extracting k from multiple consulting documents Individual consulting document, and using k consulting document as the center of the other initial clustering of k document class, which is then calculated respectively He seeks advice from the similarity between document and each document classification, and will seek advice from document classification to the corresponding document of similarity maximum In classification, and then the process that the multiple consulting documents that will be got are classified automatically is realized, and to sorted class consulting text Shelves carry out keyword extraction, and then can obtain the statistical information that a class seeks advice from document, obtain seeking advice from the corresponding consulting heat of document Point problem.After needing to be accomplished manually the classification to seeking advice from document and statistics compared to prior art, consulting focus could be obtained, and For seeking advice from the technical scheme that focus is analyzed, the efficiency to seeking advice from document classification is improve.
Description of the drawings
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing Accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this Inventive embodiment, for those of ordinary skill in the art, on the premise of not paying creative work, can be with basis The accompanying drawing of offer obtains other accompanying drawings.
Fig. 1 is a kind of flow chart of consulting analysis of central issue method disclosed in the embodiment of the present invention;
Fig. 2 is the flow chart of another kind of consulting analysis of central issue method disclosed in the embodiment of the present invention;
Schematic diagrams of the Fig. 3 for binary participle table disclosed in the embodiment of the present invention;
Fig. 4 is the flow chart of another kind of consulting analysis of central issue method disclosed in the embodiment of the present invention;
Fig. 5 is a kind of structural representation of consulting analysis of central issue device disclosed in the embodiment of the present invention;
Fig. 6 is a kind of another kind of structural representation of consulting analysis of central issue device disclosed in the embodiment of the present invention.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than the embodiment of whole.It is based on Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made Embodiment, belongs to the scope of protection of the invention.
Referring to Fig. 1, the flow chart for showing a kind of consulting analysis of central issue method provided in an embodiment of the present invention, the consulting Analysis of central issue method includes:
S101, the multiple consulting documents of acquisition;
When user is by way of note, Email or fax, when reference content is sent to client service center, due to short Letter, Email or fax are present with document form, therefore, directly the document comprising reference content that user sends is made To seek advice from document, and store it in data base.
When user is by voice mode, when consulting voice is sent to client service center, need by speech recognition technology pair Consulting voice is identified, and is translated into text message, using text message as consulting document, and is stored in data base In.
For example, when user forms consulting voice signal, voice by the consultation meeting that call customer service hotline is carried out Technology of identification is by technologies such as signal processing, pattern recognition, artificial intelligence, theory of probability, theory of information, genesis mechanism and hearing mechanisms Non-structured voice messaging of seeking advice from is converted to into structurized index, realization is to a large amount of knowledge excavations for seeking advice from voices and quickly Retrieval.From the data base comprising consulting document, multiple consulting documents are obtained, by the place of the multiple consulting documents to obtaining Reason, and then obtain the consulting focus related to the multiple consulting documents for obtaining.
S102, k consulting document is extracted from the plurality of consulting document, will described k to seek advice from document individual as k The center of the other initial clustering of document class;Wherein, k is positive integer;
The value of k can be selected according to actual needs, and the number for specifically obtaining seeking advice from focus as needed is carried out Select.For example, it is desired to 4 consulting focuses are obtained from the multiple consulting documents for obtaining, then carry from the plurality of consulting document Take 4 consultings document, i.e. k=4.
S103, calculate in the plurality of consulting document respectively other each consulting documents in addition to k is seeked advice from document and Similarity between the center of each initial clustering;
If n consulting document is obtained from data base, and k consulting document is have selected from n consulting document as k The center of the other initial clustering of individual document class, wherein, n >=k then needs to calculate respectively in n consulting document remaining n-k and consults Ask the similarity between document and the center of each initial clustering.
Specifically, during n=10, i.e., 10 consulting documents are obtained from data base, is denoted as respectively:n1、n2……n10, from 4 consulting documents are extracted in 10 consulting documents, 4 consulting documents are n respectively1、n2、n3、n4, by n1、n2、n3、n4As 4 The center of the other initial clustering of individual document class, then, calculates n respectively5……n10With n1、n2、n3、n4In each consulting document it Between similarity.
Described in other each consulting documents in S104, the plurality of consulting document of acquisition in addition to k is seeked advice from document In similarity, the center of the corresponding initial clustering of the similarity maximum;S105, will except described k consulting document it Other outer each consulting documents, are categorized into the center place document class of the corresponding initial clustering of the similarity maximum In not;
With n5As a example by, if n5With n1、n2、n3、n4Between similarity in, n5With n1Between Similarity value it is maximum, then will n5It is categorized into n1In the center place document classification of this initial clustering.
Successively, respectively by n6……n10It is categorized into n1、n2、n3、n4The document classification that the center of this four initial clusterings is located In.
S106, the key word for extracting each consulting document in the document classification, obtain the corresponding official communication of the document classification Ask focus;
For above-mentioned categorizing process, if the result of classification is:n1、n5、n10;n2、n6;n3、n7;n4、n8、n9;With n1This official communication As a example by asking the document classification that document is located, this document classification including three consulting documents, is n altogether respectively1、n5、n10, need to divide N is taken indescribably1、n5、n10Key word, obtain the corresponding consulting focus of these three document classifications.
S107, to it is described consulting focus be analyzed.
In the technical scheme provided by the embodiment of the present invention, by k consulting document is extracted from multiple consulting documents, and Using k consulting document as the center of the other initial clustering of k document class, then calculate respectively other seek advice from documents with it is every Similarity between individual document classification, and by consulting document classification in the corresponding document classification of similarity maximum, Jin Ershi The process that the multiple consulting documents that will be got are classified automatically is showed, and key word has been carried out to sorted class consulting document and carried Take, and then the statistical information that a class seeks advice from document can be obtained, obtain seeking advice from the corresponding consulting hot issue of document.Compared to existing After having technology to need to be accomplished manually the classification to seeking advice from document and statistics, consulting focus could be obtained, and is carried out to seeking advice from focus For the technical scheme of analysis, the efficiency to seeking advice from document classification is improve.
As shown in Fig. 2 the embodiment of the invention discloses a kind of consulting analysis of central issue method, the method for the present embodiment includes:
S201, the multiple consulting documents of acquisition;
S202, k consulting document is extracted from the plurality of consulting document, will described k to seek advice from document individual as k The center of the other initial clustering of document class;Wherein, k is positive integer;
The step of the present embodiment S201 and step S202 operating process respectively and the step of embodiment illustrated in fig. 1 S101 and The operating process of S102 is similar to, and will not be described here.
S203, word segmentation processing is carried out respectively to document is seeked advice from each described, obtain corresponding to document is seeked advice from each described Multiple consulting words;
Alternatively, word segmentation processing is carried out respectively to consulting document each described and can passes through Chinese lexical analysis system (Institute of Computing Technology, Chinese Lexical Analysis System, ICTCLAS) is real It is existing.
Word segmentation processing is respectively carried out to each consulting document by ICTCLAS includes:
S2031, the original character string included to seeking advice from document each described carry out atom cutting, obtain atom cutting knot Really;
S2032, the thick cutting of N- shortest paths is carried out to the atom cutting result, obtain N number of word segmentation result;It is described N number of Word segmentation result is stored in the form of binary participle table;Wherein, there is between the word for including in each described word segmentation result connection Property;
The word of S2033, calculating positioned at described binary participle table one end and the word positioned at the binary participle table other end First distance in all paths existed between language;
S2034, the word included described first in the corresponding path of minima are used as consulting word.
In actual applications, 5 can be divided into the process that each consulting document carries out word segmentation processing respectively by ICTCLAS Individual step, will the participle process of original character string be divided into 5 steps.1st step correspondence atom cutting, the 2nd step pair Answer the thick cutting of N- shortest paths, the 3rd step correspondence binary participle table, the 4th step correspondence word segmentation result, the 5th step pair Answer part-of-speech tagging.
For example, it is original character string that computer nineteen forty-six is born.
The corresponding atom cutting of the 1st step is carried out first, and obtaining atom cutting result is:Meter calculate machine 1946 year It is absurd fantastic life.
Secondly, the thick cutting of the corresponding N- shortest paths of the 2nd step is carried out, i.e., is found out comprising just by atom cutting result The N kind word segmentation results of true result, wherein, refer to meet the word of linguisticss logic comprising correct result.For example, N is arranged For 2, the thick cutting of 2- shortest paths is carried out, 2 for obtaining word segmentation result is:(1) computer nineteen forty-six be born, (2) computer 1946 year be born.
Then, carry out the corresponding binary participle table of the 3rd step, the corresponding binary participle table of above-mentioned 2 word segmentation results can To be expressed as shown in Figure 3.
Wherein, there is connectivity, so that the sequence of word can be specified between the word for being included in each word segmentation result respectively Sequentially.
Such as the 1st word segmentation result computer nineteen forty-six be born in 3 words including be respectively:Computer, Nineteen forty-six and birth, between 3 words just because of between the word included in word segmentation result have connectivity, binary participle The word segmentation result stored in table is only with computer as the 1st word, and nineteen forty-six is the 2nd word, is born as the 3rd word.
The 4th corresponding word segmentation result of step is carried out again, that is, calculate positioned at the word of binary participle table one end and positioned at institute First distance in all paths existed between the word for stating the binary participle table other end;By described first apart from minima correspondence Path in the word that includes as consulting word.
For example, the word in binary participle table one end is computer, and the word of the other end is birth, the 1st word segmentation result First distance in corresponding path is 2, and the first of the 2nd corresponding path of word segmentation result the distance is 3, it will be apparent that, the 1st point Corresponding first distance of word result is less than corresponding first distance of the 2nd word segmentation result.Therefore by the 1st corresponding road of word segmentation result Computer, nineteen forty-six and these three words that are born included in footpath are used as consulting word.
Substantially, when the corresponding word segmentation result of the 4th step is gone to, the participle process to character string is just completed, But, if needing further exist for clearly seeking advice from the part of speech of word, the 5th corresponding part-of-speech tagging of step can be continued executing with. By performing the 5th corresponding part-of-speech tagging of step, part-of-speech tagging can be carried out to the consulting word that obtains, so it is clear and definite each The part of speech of consulting word is verb, adjective or noun etc..
S204, key word is extracted respectively from seeking advice from word each described, obtain corresponding to document is seeked advice from each described Key word;
Alternatively, key word being extracted respectively from consulting word each described can pass through information retrieval with data mining Conventional weighting technique (term frequency inverse document frequency, TF-IDF) is realized.Wherein, " TF " Word frequency is represented, " IDF " represents inverse document frequency.
Key word is respectively extracted in consulting word from each described by TF-IDF, including:
S2041, each consulting word number of times for occurring in the consulting document is counted respectively;
S2042, to described each consulting word, in the consulting document, the number of times that occurs is standardized, and obtains The word frequency of each consulting word;
The number of times occurred in the word frequency of statistical consultation document, i.e. certain consulting word here consulting document, after standardization To word frequency, computational methods are as follows:
Wherein, n is to seek advice from the number of times that word occurs in consulting document, and m is to seek advice from total consulting word that document includes Number.
In S2043, respectively one corpus of statistics, including the number of the document of each consulting word;
S2044, by the total and described document of document described in the corpus include it is described each consulting word The number of the document of language, calculates the inverse document frequency of each consulting word respectively;
In actual applications, in a consulting document word frequency highest be usually " ", the interference word such as "Yes", at this moment need The document that counting a corpus includes includes the number of the document of this consulting word, and by institute in the corpus State document total and described document include each consulting word the document number, calculate described every respectively The inverse document frequency of individual consulting word.
The computational methods of inverse document frequency are as follows:
Wherein, y is the total number of documents that a corpus includes, x is the document that corpus includes this consulting word Number.By the computing formula of above-mentioned inverse document frequency it is known that x is bigger, i.e., corpus includes the document of this consulting word Number is more, and inverse document frequency value is lower, illustrates that this consulting word is more inessential.
Denominator is in order to avoid denominator is equal to 0, i.e., when there is no the text including this consulting word in a corpus for x+1 During shelves, be not in the situation divided by 0, and then avoid the situation that inverse document frequency can not obtain concrete numerical value from producing.
S2045, by it is described each consulting word word frequency with it is described each consulting word inverse document frequency be multiplied, obtain The frequency result of calculation of each consulting word;
It is more than the corresponding consulting word of frequency result of calculation of predetermined threshold value in S2046, the selection frequency result of calculation For the key word of the consulting document.
The word frequency of each consulting word is multiplied with the inverse document frequency of each consulting word, the product value for obtaining is larger, Then illustrate that this consulting word is the key word for seeking advice from document, be not otherwise the key word for seeking advice from document.
For example, the consulting word of the consulting document for obtaining includes:Large user, straight purchase, the electricity charge and composition, pass through The result that TF-IDF algorithms are obtained is such as table 1 below:
Although from table 1 it is known that " " in consulting document, occurrence number is a lot, TF values very greatly, it TF-IDF values are 0, therefore, when the key word of consulting document is extracted, " " will be filtered.
In table 1, the frequency result of calculation for obtaining each consulting word is TF-IDF, if predetermined threshold value is 0.05, that In TF-IDF, consulting word corresponding more than predetermined threshold value 0.05 is large user, directly purchases and the electricity charge, then the pass of this consulting document Keyword is large user, straight purchase and the electricity charge.
S205, according to the key word, calculate in the plurality of consulting document in addition to k consulting document respectively other Similarity between each consulting document and the center of each initial clustering
Alternatively, the similarity between consulting document is calculated by cosine similarity.
As shown in table 2:
The similarity between consulting document is calculated by cosine similarity includes:
S2051, acquisition need all key words that two consulting documents for calculating similarity include;Wherein, two official communications In inquiry document, one is a consulting document in the k consulting document at the center as initial clustering, and another is to consult except k A consulting document outside inquiry document;
By taking above-mentioned table 2 as an example, if consulting document 1 is an official communication in the k consulting document at the center as initial clustering Document is ask, and it is all other consulting documents in addition to k is seeked advice from document to seek advice from document 2 and seek advice from document 3.It is currently needed for meter Two consulting documents for calculating similarity are consulting document 1 and consulting document 2.Obtain consulting document 1 and seek advice from what document 2 included Key word, as the key word that consulting document 1 includes is:Large user, straight purchase and the electricity charge, the key word that seeking advice from document 2 includes is: Large user, straight purchase and application.All key words that then obtaining needs two consulting documents for calculating similarity to include are four, It is large user, straight purchase, the electricity charge and application respectively.
S2052, all key words are calculated respectively in k of the center as initial clustering consulting document The number of times occurred in one consulting document;
All key words are large user, directly purchase, the electricity charge and application, and the number of times occurred in consulting document 1 is respectively:30、 28th, 31 and 2.
S2053, one according to all key words in k of the center as initial clustering consulting document The number of times occurred in consulting document, obtains a consulting document in the k consulting document at the center as initial clustering Described in all key words word frequency vector;
In consulting document 1, the word frequency vector of corresponding large user, straight purchase, the electricity charge and application is [30,28,31,2].
S2054, all key words are calculated respectively occur in a consulting document in addition to k is seeked advice from document Number of times;
All key words are large user, directly purchase, the electricity charge and application, and the number of times occurred in consulting document 2 is respectively:31、 29th, 3 and 30.
S2055, according to all key words it is described except k consulting document in one consulting document in occur time Number, obtains the word frequency vector of all key words described in a consulting document except in k consulting document;
In consulting document 2, the word frequency vector of corresponding large user, straight purchase, the electricity charge and application is [31,29,3,30].
S2056, the cosine law is utilized, calculated in the k consulting document at the center as initial clustering consults The word frequency for asking all key words described in document is vectorial all with described in a consulting document except in k consulting document Included angle cosine value between the word frequency vector of key word;
Wherein, the included angle cosine value represents an official communication in the k consulting document at the center as initial clustering Ask the similarity between document and a consulting document in addition to k is seeked advice from document.The computing formula of included angle cosine value For:
Wherein, A and B represent the word frequency vector for needing two consulting documents for calculating similarity to distinguish corresponding key word; AiElement in expression word frequency vector A;BiElement in expression word frequency vector B;N is the dimension of vector.
It is known that cosine value is closer to 1 from the computing formula of above-mentioned included angle cosine value, angle is represented closer to 0 degree, Similarity between i.e. two consulting documents is higher.
In table 2, using the cosine law, what the similarity between calculated consulting document 1 and consulting document 2 was obtained Result of calculation is 0.8.Similarity between consulting document 1 and consulting document 3 is 0.1.Obviously, the key word for seeking advice from document has two Similarity between the consulting document 1 and consulting document 2 of individual overlap is high.
Described in other each consulting documents in S206, the plurality of consulting document of acquisition in addition to k is seeked advice from document In similarity, the center of the corresponding initial clustering of the similarity maximum;
S207, by other each the consulting documents in addition to described k is seeked advice from document, be categorized into the similarity maximum In the center place document classification of the corresponding initial clustering;
S208, the key word for extracting each consulting document in the document classification, obtain the corresponding official communication of the document classification Ask focus;
S209, to it is described consulting focus be analyzed.
The step of the present embodiment S206-S209 operating process S104-S107 respectively and the step of embodiment illustrated in fig. 1 Operating process is similar to, and will not be described here.
In the above-described embodiments, the application is by extracting all keys from two consulting documents for needing calculating similarity Word, obtains the word frequency vector of all key words in each consulting document, and according to word frequency vector, calculates two using the cosine law Similarity between consulting document.
As the word frequency vector of general consulting document is all sparse, i.e., in word frequency vector, there was only the non-zero of less number Value, directly calculates the distance between two word frequency vectors, has substantial amounts of null value matching, cause two in causing two word frequency vectors Individual actually dissimilar word frequency vector distance very little, and then cause the similarity between two consulting documents of erroneous judgement higher. Based on this, the similarity between two word frequency vectors is calculated using the cosine law, null value is dry during word frequency can be avoided vectorial Disturb.The accuracy for judging similarity between two word frequency vectors is improve, and as similarity is the foundation of consulting document classification, Therefore, so improve consulting document classification accuracy.
As shown in figure 4, the embodiment of the invention discloses a kind of consulting analysis of central issue method, the method for the present embodiment includes:
S301, the multiple consulting documents of acquisition;
S302, k consulting document is extracted from the plurality of consulting document, will described k to seek advice from document individual as k The center of the other initial clustering of document class;Wherein, k is positive integer;
S303, word segmentation processing is carried out respectively to document is seeked advice from each described, obtain corresponding to document is seeked advice from each described Multiple consulting words;
S304, key word is extracted respectively from seeking advice from word each described, obtain corresponding to document is seeked advice from each described Key word;
S305, according to the key word, calculate in the plurality of consulting document in addition to k consulting document respectively other The operation of the step of each seeks advice from the similarity the present embodiment between document and the center of each initial clustering S301-S305 Process is similar with the operating process of S201 the step of embodiment illustrated in fig. 2 and S205 respectively, will not be described here.
S306, using the similarity calculate respectively it is described in addition to k is seeked advice from document other each seek advice from documents and arrive The second distance at the center of each initial clustering;
S307, the center for obtaining the corresponding initial clustering of the second distance minima;Wherein, the second distance Less, the similarity is bigger;
S308, described other each consulting documents in addition to k is seeked advice from document are divided into the second distance most It is little to be worth in the center place document classification of the corresponding initial clustering;
Other all documents of seeking advice from the multiple consulting documents that will be chosen in addition to k is seeked advice from document as sample, Each sample classification in document classification closest therewith, i.e., so that the distance between center minimum of sample and cluster, The formula of computed range is:
C(i):=arg minjcos(x(i), μj)
Wherein, C(i)For the document classification that i-th sample is assigned to, x(i)For i-th sample, μjFor j-th document classification Center, its implication is i-th sample to be distributed in the corresponding document classification in the center minimum with its included angle cosine value.
S309, judge in addition to the center for being chosen for initial clustering, whether other each consulting documents have been respectively divided To in k document classification;
S3010, when judged result for be when, recalculate the other center of each document class in k document classification, obtain k The center of individual first cluster;
The computing formula at the other center of wherein j-th document class is:
Wherein,
Total numbers of the m for sample.
It is understood that when judged result is no, illustrating other consultings in addition to the center for being chosen for initial clustering In document, either with or without the consulting document at the center for being divided into k initial clustering.Now, return perform S308, that is, continue executing with by Be not divided into the center of k initial clustering consulting document be divided into it is corresponding with minima in the second distance it is described just The other step of the corresponding document class in center of the cluster that begins.
Whether the center of S3011, the center for being respectively compared the k first cluster and k initial clustering is identical;
If it is different, then performing S3012.
S3012, using the center of the first clusters of the k as the other new cluster centre of k document class;K first cluster Center it is different from the center of k initial clustering, then illustrate to divide other each the consulting documents in addition to k is seeked advice from document To after the center place document classification of initial clustering, the other cluster centre of k document class there occurs change, no longer be k initial The center of cluster.Now need to cluster again all of consulting document, that is, re-start division.
S3013, the plurality of consulting document is calculated using the similarity respectively to the of new cluster centre each described Three distances;
S3014, by the plurality of consulting document be divided into the described 3rd in the corresponding new cluster of minima In the document classification of heart place;
With new cluster centre, all consulting documents are clustered again.Notice that new cluster centre now can not be one Consulting document, which can be only a key word or multiple key words.
S3015, judge it is the plurality of consulting document whether be all respectively divided in k document classification;
S3016, when judged result for be when, recalculate the other center of each document class in k document classification, obtain k The center of individual second cluster;
Whether S3017, the center for being respectively compared the k second cluster are identical with the center of last cluster;
Whether the center for comparing the second cluster is identical with the center of the first cluster.
If it is different, performing S3018.
S3018, using the center of the second clusters of the k as the other new cluster centre of k document class;Return and perform S3013。
S3019, the key word for extracting each consulting document in the document classification, obtain the corresponding official communication of the document classification Ask focus;
The key word of each consulting document in document classification can be extracted by TF-IDF algorithms, the document classification is obtained Corresponding consulting focus.
In the present embodiment, the key word for extracting single consulting document is accomplished that by TF-IDF algorithms, and in S3012 In be accomplished that to a whole document classification in all consulting documents extract key words, and then obtain a whole document classification institute Corresponding consulting focus.
For example, in above-mentioned table 2, if consulting document 1 and consulting document 2 are classified as a document classification, by TF- IDF algorithms extract the key word of each consulting document in document classification, and it is " big to use to obtain the corresponding consulting focus of this document classification Directly purchase at family ".
S3020, to it is described consulting focus be analyzed.
In the present embodiment, by k consulting document is extracted from multiple consulting documents, and k consulting document is made respectively For the center of the other initial clustering of k document class, the phase between other consulting documents and each document classification is then calculated respectively Like degree, and document classification will be seeked advice in the corresponding document classification of similarity maximum, and then be realized multiple by what is got The process that consulting document is classified automatically by K-Means algorithms, and keyword extraction is carried out to sorted class consulting document, And then the statistical information that a class seeks advice from document can be obtained, obtain seeking advice from the corresponding consulting hot issue of document.Compared to existing After technology needs to be accomplished manually the classification to seeking advice from document and statistics, could obtain and seek advice from focus, and consulting focus is carried out point For the technical scheme of analysis, the efficiency to seeking advice from document classification is improve.
And, multiple consulting documents are classified automatically, extract key word and obtain the other consulting of a class document class in the application After focus, to seeking advice from analysis of central issue.Based on this, after giveing training to the operator of client service center so that operator is consulted to client Inquiry topic has the assurance of entirety, when operator receives the reference content of client, can be with directive by existing customer Reference content artificial division in a class document classification, and search from knowledge base with regard to this document classification focus answer and from Reason, and then quickly can find with regard to the high answer of the standardization level of existing customer reference content, not only realize quick The function of answer client's reference content, meanwhile, solve answer side when operator in prior art is answered to client's reference content Formula lacks unicity, the problem that answer subjectivity is strong, standardization level is low.
A kind of consulting analysis of central issue method shown in correspondence Fig. 1, present invention also offers a kind of consulting analysis of central issue device, Its structural representation is referred to shown in Fig. 5, and a kind of consulting analysis of central issue device that the present embodiment is provided includes:First acquisition unit 11st, the first extraction unit 12, the first computing unit 13, second acquisition unit 14, the first taxon 15, the second extraction unit 16 With analytic unit 17.
First acquisition unit 11, for obtaining multiple consulting documents;
First extraction unit 12, for k consulting document is extracted from the plurality of consulting document, described k is seeked advice from Center of the document respectively as the other initial clustering of k document class;Wherein, k is positive integer;
First computing unit 13, for calculate respectively it is the plurality of consulting document in except k consulting document in addition to other Similarity between each consulting document and the center of each initial clustering;
Second acquisition unit 14, for obtain in the plurality of consulting document in addition to k consulting document other each In the similarity of consulting document, the center of the corresponding initial clustering of the similarity maximum;First taxon 15, for by other each the consulting documents in addition to described k is seeked advice from document, being categorized into the similarity maximum corresponding In the center place document classification of the initial clustering;
Second extraction unit 16, for extracting the key word of each consulting document in the document classification, obtains the text The corresponding consulting focus of shelves classification;
Analytic unit 17, for being analyzed to the consulting focus.
Present embodiment discloses a kind of consulting analysis of central issue device, obtains multiple consulting documents by first acquisition unit, First extraction unit extracts k consulting document from multiple consulting documents, and using k consulting document as k document class The center of other initial clustering, is then calculated between other consulting documents and each document classification respectively by the first computing unit Similarity, and the similarity maximum pair that gets to second acquisition unit of document classification will be seeked advice from by the first taxon In the document classification answered, and then the purpose that the multiple consulting documents that will be got are classified automatically is realized, and extracted by second Unit carries out keyword extraction to sorted class consulting document, and then can obtain the statistical information that a class seeks advice from document, Obtain seeking advice from the corresponding consulting hot issue of document.Need to be accomplished manually the classification to seeking advice from document and system compared to prior art After meter, consulting focus could be obtained, and for seeking advice from the technical scheme that focus is analyzed, be improve to seeking advice from document classification Efficiency.
Fig. 6 is referred to, a kind of another kind of structure of consulting analysis of central issue device of the embodiment of the present application offer is provided Schematic diagram, on the basis of Fig. 5, can also include:Word segmentation processing unit 21 and keyword extracting unit 22.
Word segmentation processing unit 21, for carrying out word segmentation processing respectively to document is seeked advice from each described, obtains corresponding to each Multiple consulting words of the consulting document.
Wherein, the word segmentation processing unit 21 includes:The thick cutting unit of atom cutting unit 31, shortest path 32, second Computing unit 33 and determining unit 34.
Atom cutting unit 31, the original character string for including to seeking advice from document each described carry out atom cutting, Obtain atom cutting result;
The thick cutting unit 32 of shortest path, for carrying out the thick cutting of N- shortest paths to the atom cutting result, obtains N Individual word segmentation result;
Second computing unit 33, for the word for calculating the word of binary participle table one end be located at the binary participle table other end First distance in all paths existed between language;
Determining unit 34, for determining described first apart from minima from the described first distance, by first distance The word included in the corresponding path of minima is used as consulting word.Keyword extracting unit 22, for from consulting each described Extract key word in word respectively, obtain corresponding to the key word that document is seeked advice from each described.
Wherein, the keyword extracting unit 22 includes:First statistic unit 41, the statistics of Standardisation Cell 42, second are single Unit 43, inverse document frequency computing unit 44, frequency computing unit 45 and selection unit 46.
First statistic unit 41, for counting the number of times that each consulting word occurs in the consulting document respectively;
Standardisation Cell 42, for carrying out standard to the number of times that described each consulting word occurs in the consulting document Change is processed, and obtains the word frequency of each consulting word;
Second statistic unit 43, for being counted in corpus respectively, including the document of each consulting word Number;
Inverse document frequency computing unit 44, in the total and described document by document described in the corpus Including the number of the document of each consulting word, the inverse document frequency of each consulting word is calculated respectively;
Frequency computing unit 45, for by it is described each consulting word word frequency with it is described each consulting word inverse document Frequency multiplication, obtains the frequency result of calculation of each consulting word;
Choose unit 46, for choose in the frequency result of calculation more than predetermined threshold value frequency result of calculation it is corresponding Consulting word is the key word of the consulting document.
Alternatively, in the present embodiment, first computing unit 13 includes:3rd acquiring unit the 51, the 3rd calculates single Unit 52, word frequency vector location 53 and included angle cosine value computing unit 54.
3rd acquiring unit 51, for obtaining all keys for needing two consulting documents for calculating similarity to include Word;Wherein, in two consulting documents one be that in k of the center as initial clustering consulting document seeks advice from document, Another is a consulting document in addition to k is seeked advice from document;
3rd computing unit 52, for calculating k of all key words at the center as initial clustering respectively The number of times occurred in a consulting document in individual consulting document;
Word frequency vector location 53, consults for k according to all key words at the center as initial clustering The number of times occurred in asking one in document consulting document, obtains in the k consulting document at the center as initial clustering One consulting document described in all key words word frequency vector;
Included angle cosine value computing unit 54, for utilizing the cosine law, calculates k of the center as initial clustering Consulting document in one consulting document described in all key words word frequency it is vectorial with it is described except k seek advice from document in one Included angle cosine value between the word frequency vector of all key words described in consulting document;
Wherein, the included angle cosine value represents an official communication in the k consulting document at the center as initial clustering Ask the similarity between document and a consulting document in addition to k is seeked advice from document.
Alternatively, in the present embodiment, the second acquisition unit 14 includes:4th computing unit the 61, the 4th obtains single Unit 62;
First taxon 15 includes:Second taxon 63, judging unit 64, the 5th computing unit 65, first Comparing unit 66, the 6th computing unit 67, the 3rd taxon 68, the second judging unit 69, the 7th computing unit 70 and second Comparing unit 71.
4th computing unit 61, for using similarity calculate respectively it is described in addition to k consulting document other each Second distance of the consulting document to the center of each initial clustering;
4th acquiring unit 62, for obtaining the center of the corresponding initial clustering of the second distance minima;
Wherein, the second distance is less, and the similarity is bigger.
Second taxon 63, for by described other each consulting documents in addition to k is seeked advice from document be divided into In the center place document classification of the corresponding initial clustering of the second distance minima;
Judging unit 64, for judging in addition to the center for being chosen for initial clustering, whether other each consulting documents In divided to k document classification;5th computing unit 65, for when the judged result of the judging unit for be when, again The other center of each document class in k document classification is calculated, the center of k first cluster is obtained;First comparing unit 66, is used for Whether the center for being respectively compared the k first cluster is identical with the center of k initial clustering;
6th computing unit 67, for when the comparative result of first comparing unit is different, by the k first The center of cluster respectively calculates the plurality of consulting document using the similarity as the other new cluster centre of k document class To the 3rd distance of new cluster centre each described;
3rd taxon 68, it is corresponding apart from minima with the described 3rd for the plurality of consulting document is divided into In the new cluster centre place document classification;
Second judging unit 69, for judging whether the plurality of consulting document has all been respectively divided k document class In not;
7th computing unit 70, for when the judged result of second judging unit is to be, recalculating k document The other center of each document class in classification, obtains the center of k second cluster;
Second comparing unit 71, for being respectively compared the center of the k second cluster with the center of last cluster be It is no identical;
6th computing unit 67, is additionally operable to when the comparative result of second comparing unit is different, by the k The center of individual second cluster is used as the other new cluster centre of k document class;The plurality of official communication is calculated respectively using the similarity Document is ask to the 3rd distance of new cluster centre each described.
Present embodiment discloses a kind of consulting analysis of central issue device, obtains multiple consulting documents by first acquisition unit, First extraction unit extracts k consulting document from multiple consulting documents, and using k consulting document as k document class The center of other initial clustering, is then calculated between other consulting documents and each document classification respectively by the first computing unit Similarity, and the similarity maximum pair that gets to second acquisition unit of document classification will be seeked advice from by the first taxon In the document classification answered, and then the purpose that the multiple consulting documents that will be got are classified automatically is realized, and extracted by second Unit carries out keyword extraction to sorted class consulting document, and then can obtain the statistical information that a class seeks advice from document, Obtain seeking advice from the corresponding consulting hot issue of document.Need to be accomplished manually the classification to seeking advice from document and system compared to prior art After meter, consulting focus could be obtained, and for seeking advice from the technical scheme that focus is analyzed, be improve to seeking advice from document classification Efficiency.
And, multiple consulting documents are classified automatically, extract key word and obtain the other consulting of a class document class in the application After focus, to seeking advice from analysis of central issue.Based on this, after giveing training to the operator of client service center so that operator is consulted to client Inquiry topic has the assurance of entirety, when operator receives the reference content of client, can be with directive by existing customer Reference content artificial division in a class document classification, and search from knowledge base with regard to this document classification focus answer and from Reason, and then quickly can find with regard to the high answer of the standardization level of existing customer reference content, not only realize quick The function of answer client's reference content, meanwhile, solve answer side when operator in prior art is answered to client's reference content Formula lacks unicity, the problem that answer subjectivity is strong, standardization level is low.
It should be noted that each embodiment in this specification is described by the way of progressive, each embodiment weight Point explanation is all difference with other embodiment, between each embodiment identical similar part mutually referring to. It is for the device that embodiment is provided, corresponding with the method that embodiment is provided due to which, so fairly simple, the phase of description Part is closed referring to method part illustration.
Finally, in addition it is also necessary to explanation, herein, such as first and second or the like relational terms be used merely to by One entity or operation are made a distinction with another entity or operation, and are not necessarily required or implied these entities or operation Between there is any this actual relation or order.And, term " including ", "comprising" or its any other variant are anticipated Covering including for nonexcludability, so that a series of process, method, article or equipment including key elements not only includes that A little key elements, but also including other key elements being not expressly set out, or also include for this process, method, article or The intrinsic key element of equipment.In the absence of more restrictions, the key element for being limited by sentence "including a ...", not Exclude and also there is other identical element in the process including the key element, method, article or equipment.
For convenience of description, it is divided into various units with function when describing apparatus above to describe respectively.Certainly, implementing this The function of each unit can be realized in same or multiple softwares and/or hardware during application.
As seen through the above description of the embodiments, those skilled in the art can be understood that the application can By software plus required general hardware platform mode realizing.Based on such understanding, the technical scheme essence of the application On part that in other words prior art is contributed can be embodied in the form of software product, the computer software product Can be stored in storage medium, such as ROM/RAM, magnetic disc, CD etc., use so that a computer equipment including some instructions (can be personal computer, server, or network equipment etc.) performs some of each embodiment of the application or embodiment Method described in part.
Above a kind of data sharing method provided herein, system and mobile terminal are described in detail, this Apply specific case to be set forth the principle and embodiment of the application in text, the explanation of above example is only intended to Help understands the present processes and its core concept;Simultaneously for one of ordinary skill in the art, according to the think of of the application Think, will change in specific embodiments and applications, in sum, it is right that this specification content should not be construed as The restriction of the application.

Claims (14)

1. it is a kind of to seek advice from analysis of central issue method, it is characterised in that to include:
Obtain multiple consulting documents;
K consulting document is extracted from the plurality of consulting document, using described k consulting document as k document classification Initial clustering center;Wherein, k is positive integer;
Other each the consulting documents calculated in the plurality of consulting document respectively in addition to k is seeked advice from document are described first with each The similarity begun between the center of cluster;
In the similarity of other each the consulting documents in the plurality of consulting document of acquisition in addition to k is seeked advice from document, The center of the corresponding initial clustering of the similarity maximum;
By other each the consulting documents in addition to described k is seeked advice from document, the corresponding institute of the similarity maximum is categorized into State in the center place document classification of initial clustering;
The key word of each consulting document in the document classification is extracted, the corresponding consulting focus of the document classification is obtained;
The consulting focus is analyzed.
2. method according to claim 1, it is characterised in that described to be calculated in the plurality of consulting document respectively except k The similarity between other each consulting documents and the center of each initial clustering outside consulting document includes:
Word segmentation processing is carried out respectively to document is seeked advice from each described, is obtained corresponding to multiple consulting words that document is seeked advice from each described Language;
Key word is extracted respectively from seeking advice from word each described, is obtained corresponding to the key word that document is seeked advice from each described;
According to the key word, calculate respectively in the plurality of consulting document in addition to k is seeked advice from document other each seek advice from Similarity between the center of document and each initial clustering.
3. method according to claim 2, it is characterised in that described to be carried out at participle respectively to document is seeked advice from each described Reason, obtaining includes corresponding to multiple consulting words that document is seeked advice from each described:
The original character string included to seeking advice from document each described carries out atom cutting, obtains atom cutting result;
The thick cutting of N- shortest paths is carried out to the atom cutting result, N number of word segmentation result is obtained;N number of word segmentation result with The form storage of binary participle table;Wherein, there is between the word for including in each described word segmentation result connectivity;
Calculate positioned at the word of described binary participle table one end and exist between the word of the binary participle table other end All paths the first distance;
The word that described first is included in the corresponding path of minima is used as consulting word.
4. method according to claim 2, it is characterised in that described extracted in word respectively crucial to seek advice from each described Word, obtains including corresponding to the key word that document is seeked advice from each described:
The number of times that each consulting word occurs in the consulting document is counted respectively;
The number of times that described each consulting word occurs in the consulting document is standardized, described each official communication is obtained Ask the word frequency of word;
Counted in a corpus respectively, including the number of the document of each consulting word;
The text of each consulting word is included by the total and described document of document described in the corpus The number of shelves, calculates the inverse document frequency of each consulting word respectively;
The word frequency of each consulting word is multiplied with the inverse document frequency of each consulting word, described each official communication is obtained Ask the frequency result of calculation of word;
In choosing the frequency result of calculation, consulting word corresponding more than the frequency result of calculation of predetermined threshold value is the consulting The key word of document.
5. method according to claim 2, it is characterised in that described according to the key word, is calculated the plurality of respectively The phase between other each centers of consulting document with initial clustering each described in consulting document in addition to k is seeked advice from document Include like degree:
Acquisition needs all key words that two consulting documents for calculating similarity include;Wherein, one in two consulting documents Individual is the consulting document that k of the center as initial clustering is seeked advice from document, and another is in addition to k is seeked advice from document One consulting document;
A consulting text of all key words in the k consulting document at the center as initial clustering is calculated respectively The number of times occurred in shelves;
According in a consulting document of all key words in the k consulting document at the center as initial clustering The number of times of appearance, obtains all described in a consulting document in the k consulting document at the center as initial clustering The word frequency vector of key word;
The number of times that all key words occur in a consulting document in addition to k is seeked advice from document is calculated respectively;
According to the number of times that all key words occur in a consulting document except in k consulting document, institute is obtained State except the word frequency vector of all key words described in a consulting document in k consulting document;
Using the cosine law, institute in a consulting document in the k consulting document at the center as initial clustering is calculated State the vectorial word with all key words described in a consulting document except in k consulting document of word frequency of all key words Included angle cosine value between frequency vector;
Wherein, the included angle cosine value represents a consulting text in the k consulting document at the center as initial clustering Similarity between shelves and a consulting document in addition to k is seeked advice from document.
6. method according to claim 1 and 2, it is characterised in that except k official communication in the plurality of consulting document of the acquisition Ask in the similarity of other each the consulting documents outside document, the corresponding initial clustering of the similarity maximum Center include:
Described other each consulting documents in addition to k is seeked advice from document are calculated respectively at the beginning of each described using the similarity The second distance at the center of the cluster that begins;
Obtain the center of the corresponding initial clustering of the second distance minima;
Wherein, the second distance is less, and the similarity is bigger.
7. method according to claim 6, it is characterised in that other by addition to described k is seeked advice from document are every Individual consulting document, the center place document classification for being categorized into the corresponding initial clustering of the similarity maximum include:
Described other each consulting documents in addition to k is seeked advice from document are divided into corresponding with the second distance minima In the center place document classification of the initial clustering;
Judge in addition to the center for being chosen for initial clustering, whether other each consulting documents have been respectively divided k document In classification;
When judged result is to be, the other center of each document class in k document classification is recalculated, obtain k first cluster Center;
Whether the center for being respectively compared the k first cluster is identical with the center of k initial clustering;
If it is different, then using the center of the first clusters of the k as the other new cluster centre of k document class;
The plurality of consulting document is calculated respectively to the 3rd distance of new cluster centre each described using the similarity;
By the plurality of consulting document be divided into the described 3rd apart from the corresponding new cluster centre place document of minima In classification;
Judge whether the plurality of consulting document has all been respectively divided in k document classification;
When judged result is to be, the other center of each document class in k document classification is recalculated, obtain k second cluster Center;
Whether the center for being respectively compared the k second cluster is identical with the center of last cluster;
If it is different, then using the center of the second clusters of the k as the other new cluster centre of k document class;
Return to perform the plurality of consulting document is calculated respectively to the 3rd of new cluster centre each described using the similarity Distance.
8. it is a kind of to seek advice from analysis of central issue device, it is characterised in that to include:
First acquisition unit, for obtaining multiple consulting documents;
First extraction unit, for k consulting document is extracted from the plurality of consulting document, by described k consulting document point Not as the center of the other initial clustering of k document class;Wherein, k is positive integer;
First computing unit, for calculating other each official communications in the plurality of consulting document in addition to k is seeked advice from document respectively Ask the similarity between document and the center of each initial clustering;
Second acquisition unit, for obtaining other each the consulting texts in the plurality of consulting document in addition to k is seeked advice from document In the similarity of shelves, the center of the corresponding initial clustering of the similarity maximum;
First taxon, for by other each the consulting documents in addition to described k is seeked advice from document, being categorized into described similar In the center place document classification of the corresponding initial clustering of degree maximum;
Second extraction unit, for extracting the key word of each consulting document in the document classification, obtains the document classification Corresponding consulting focus;
Analytic unit, for being analyzed to the consulting focus.
9. it is according to claim 7 to seek advice from analysis of central issue device, it is characterised in that also to include:
Word segmentation processing unit, for carrying out word segmentation processing respectively to document is seeked advice from each described, obtains corresponding to official communication each described Ask multiple consulting words of document;
Keyword extracting unit, for respectively extracting key word in consulting word from each described, obtains corresponding to described in each The key word of consulting document;
First computing unit, for according to the key word, calculating respectively in the plurality of consulting document except k consulting text The similarity between other each consulting documents and the center of each initial clustering outside shelves.
10. it is according to claim 8 to seek advice from analysis of central issue device, it is characterised in that the word segmentation processing unit includes:
Atom cutting unit, the original character string for including to seeking advice from document each described carry out atom cutting, obtain original Sub- cutting result;
The thick cutting unit of shortest path, for carrying out the thick cutting of N- shortest paths to the atom cutting result, obtains N number of participle As a result;
Second computing unit, for calculating the word of binary participle table one end and between the word of the binary participle table other end First distance in all paths for existing;
Determining unit, for determining described first apart from minima from the described first distance, by described first apart from minima The word included in corresponding path is used as consulting word.
11. consulting analysis of central issue devices according to claim 8, it is characterised in that the keyword extracting unit includes:
First statistic unit, for counting the number of times that each consulting word occurs in the consulting document respectively;
Standardisation Cell, for being standardized place to the number of times that described each consulting word occurs in the consulting document Reason, obtains the word frequency of each consulting word;
Second statistic unit, for being counted in a corpus respectively, including the number of the document of each consulting word;
Inverse document frequency computing unit, for including institute by the total and described document of document described in the corpus The number of the document of each consulting word is stated, the inverse document frequency of each consulting word is calculated respectively;
Frequency computing unit, for by it is described each consulting word word frequency with it is described each consulting word inverse document frequency phase Take advantage of, obtain the frequency result of calculation of each consulting word;
Unit is chosen, for choosing the corresponding consulting word of frequency result of calculation in the frequency result of calculation more than predetermined threshold value Language is the key word of the consulting document.
12. consulting analysis of central issue devices according to claim 8, it is characterised in that first computing unit includes:
3rd acquiring unit, for obtaining all key words for needing two consulting documents for calculating similarity to include;Wherein, In two consulting documents, one is a consulting document in the k consulting document at the center as initial clustering, and another is A consulting document in addition to k is seeked advice from document;
3rd computing unit, for calculating k consulting of all key words at the center as initial clustering respectively The number of times occurred in a consulting document in document;
Word frequency vector location, seeks advice from document for k according to all key words at the center as initial clustering In a consulting document in the number of times that occurs, obtain k seeked advice from document at the center as initial clustering The word frequency vector of all key words described in consulting document;
Included angle cosine value computing unit, for utilizing the cosine law, calculates the k consulting text at the center as initial clustering The word frequency of all key words described in a consulting document in shelves it is vectorial with described except a consulting in k consulting document is literary Included angle cosine value between the word frequency vector of all key words described in shelves;
Wherein, the included angle cosine value represents a consulting text in the k consulting document at the center as initial clustering Similarity between shelves and a consulting document in addition to k is seeked advice from document.
The 13. consulting analysis of central issue devices according to claim 7 or 8, it is characterised in that the second acquisition unit bag Include:4th computing unit, for calculating described other each consulting documents in addition to k is seeked advice from document respectively using similarity To the second distance at the center of each initial clustering;
4th acquiring unit, for obtaining the center of the corresponding initial clustering of the second distance minima;
Wherein, the second distance is less, and the similarity is bigger.
14. consulting analysis of central issue devices according to claim 12, it is characterised in that first taxon includes:
Second taxon, for described other each consulting documents in addition to k is seeked advice from document are divided into and described the Two in the center place document classification of the corresponding initial clustering of minima;
Judging unit, for judging in addition to the center for being chosen for initial clustering, whether other each consulting documents are divided To in k document classification;5th computing unit, it is literary for when the judged result of the judging unit is to be, recalculating k The other center of each document class in shelves classification, obtains the center of k first cluster;
First comparing unit, for being respectively compared the center of the center of the first clusters of the k and k initial clustering whether phase Together;
6th computing unit, for when the comparative result of first comparing unit is different, by the k first cluster Center calculates the plurality of consulting document respectively using the similarity and arrives each as the other new cluster centre of k document class 3rd distance of the new cluster centre;
3rd taxon is corresponding described new apart from minima with the described 3rd for the plurality of consulting document is divided into In the document classification of cluster centre place;
Second judging unit, for judging whether the plurality of consulting document has all been respectively divided in k document classification;
7th computing unit, for when the judged result of second judging unit is to be, recalculating in k document classification The other center of each document class, obtains the center of k second cluster;
Whether the second comparing unit is identical with the center of last cluster for being respectively compared the center of the k second cluster;
6th computing unit, is additionally operable to when the comparative result of second comparing unit is different, by the k second The center of cluster is used as the other new cluster centre of k document class;The plurality of consulting document is calculated respectively using the similarity To the 3rd distance of new cluster centre each described.
CN201610974447.8A 2016-11-04 2016-11-04 Consultation hotspot analysis method and device Pending CN106528768A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610974447.8A CN106528768A (en) 2016-11-04 2016-11-04 Consultation hotspot analysis method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610974447.8A CN106528768A (en) 2016-11-04 2016-11-04 Consultation hotspot analysis method and device

Publications (1)

Publication Number Publication Date
CN106528768A true CN106528768A (en) 2017-03-22

Family

ID=58349769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610974447.8A Pending CN106528768A (en) 2016-11-04 2016-11-04 Consultation hotspot analysis method and device

Country Status (1)

Country Link
CN (1) CN106528768A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193872A (en) * 2017-04-14 2017-09-22 深圳前海微众银行股份有限公司 Question and answer data processing method and device
CN107315830A (en) * 2017-07-10 2017-11-03 深圳市视维科技股份有限公司 A kind of method and system of intellectual analysis document
CN108846429A (en) * 2018-05-31 2018-11-20 清华大学 Cyberspace resource automatic classification method and device based on unsupervised learning
CN110022242A (en) * 2018-12-13 2019-07-16 北京神州绿盟信息安全科技股份有限公司 A kind of keyword determines method and device
CN110110326A (en) * 2019-04-25 2019-08-09 西安交通大学 A kind of text cutting method based on subject information
CN110851562A (en) * 2019-08-19 2020-02-28 湖南正宇软件技术开发有限公司 Information acquisition method, system, equipment and storage medium
CN111311450A (en) * 2020-02-28 2020-06-19 重庆百事得大牛机器人有限公司 Big data management platform and method for legal consultation service

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101414300A (en) * 2008-11-28 2009-04-22 电子科技大学 Method for sorting and processing internet public feelings information
CN102147792A (en) * 2010-02-09 2011-08-10 中国科学院计算技术研究所 Customized knowledge intelligent system
CN103324665A (en) * 2013-05-14 2013-09-25 亿赞普(北京)科技有限公司 Hot spot information extraction method and device based on micro-blog
CN104239436A (en) * 2014-08-27 2014-12-24 南京邮电大学 Network hot event detection method based on text classification and clustering analysis
CN105183855A (en) * 2015-09-08 2015-12-23 浪潮(北京)电子信息产业有限公司 Information classification method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101414300A (en) * 2008-11-28 2009-04-22 电子科技大学 Method for sorting and processing internet public feelings information
CN102147792A (en) * 2010-02-09 2011-08-10 中国科学院计算技术研究所 Customized knowledge intelligent system
CN103324665A (en) * 2013-05-14 2013-09-25 亿赞普(北京)科技有限公司 Hot spot information extraction method and device based on micro-blog
CN104239436A (en) * 2014-08-27 2014-12-24 南京邮电大学 Network hot event detection method based on text classification and clustering analysis
CN105183855A (en) * 2015-09-08 2015-12-23 浪潮(北京)电子信息产业有限公司 Information classification method and system

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193872A (en) * 2017-04-14 2017-09-22 深圳前海微众银行股份有限公司 Question and answer data processing method and device
CN107315830A (en) * 2017-07-10 2017-11-03 深圳市视维科技股份有限公司 A kind of method and system of intellectual analysis document
CN108846429A (en) * 2018-05-31 2018-11-20 清华大学 Cyberspace resource automatic classification method and device based on unsupervised learning
CN110022242A (en) * 2018-12-13 2019-07-16 北京神州绿盟信息安全科技股份有限公司 A kind of keyword determines method and device
CN110022242B (en) * 2018-12-13 2020-12-25 北京神州绿盟信息安全科技股份有限公司 Keyword determination method and device
CN110110326A (en) * 2019-04-25 2019-08-09 西安交通大学 A kind of text cutting method based on subject information
CN110110326B (en) * 2019-04-25 2020-10-27 西安交通大学 Text cutting method based on subject information
CN110851562A (en) * 2019-08-19 2020-02-28 湖南正宇软件技术开发有限公司 Information acquisition method, system, equipment and storage medium
CN111311450A (en) * 2020-02-28 2020-06-19 重庆百事得大牛机器人有限公司 Big data management platform and method for legal consultation service
CN111311450B (en) * 2020-02-28 2024-03-29 重庆百事得大牛机器人有限公司 Big data management platform and method for legal consultation service

Similar Documents

Publication Publication Date Title
CN107609121B (en) News text classification method based on LDA and word2vec algorithm
CN106528768A (en) Consultation hotspot analysis method and device
CN108052583B (en) E-commerce ontology construction method
CN108304468B (en) Text classification method and text classification device
CN104778209B (en) A kind of opining mining method for millions scale news analysis
CN108073568A (en) keyword extracting method and device
Wen et al. Research on keyword extraction based on word2vec weighted textrank
CN104199965B (en) Semantic information retrieval method
CN107844559A (en) A kind of file classifying method, device and electronic equipment
CN107609052A (en) A kind of generation method and device of the domain knowledge collection of illustrative plates based on semantic triangle
CN107436875A (en) File classification method and device
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
CN109376352B (en) Patent text modeling method based on word2vec and semantic similarity
CN105389341B (en) A kind of service calls repeat the text cluster and analysis method of incoming call work order
CN110297988A (en) Hot topic detection method based on weighting LDA and improvement Single-Pass clustering algorithm
CN108170692A (en) A kind of focus incident information processing method and device
CN110675269B (en) Text auditing method and device
Naik et al. Extractive text summarization by feature-based sentence extraction using rule-based concept
CN109947934A (en) For the data digging method and system of short text
CN112100396A (en) Data processing method and device
CN110019713A (en) Based on the data retrieval method and device, equipment and storage medium for being intended to understand
CN113254643A (en) Text classification method and device, electronic equipment and
CN112668323A (en) Text element extraction method based on natural language processing and text examination system thereof
Nguyen et al. An ensemble of shallow and deep learning algorithms for Vietnamese sentiment analysis
CN116108181A (en) Client information processing method and device and electronic equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170322

RJ01 Rejection of invention patent application after publication