WO2021128342A1 - 文档处理的方法和装置 - Google Patents

文档处理的方法和装置 Download PDF

Info

Publication number
WO2021128342A1
WO2021128342A1 PCT/CN2019/129423 CN2019129423W WO2021128342A1 WO 2021128342 A1 WO2021128342 A1 WO 2021128342A1 CN 2019129423 W CN2019129423 W CN 2019129423W WO 2021128342 A1 WO2021128342 A1 WO 2021128342A1
Authority
WO
WIPO (PCT)
Prior art keywords
document
sentence
sentences
qth
frequency
Prior art date
Application number
PCT/CN2019/129423
Other languages
English (en)
French (fr)
Inventor
惠浩添
车效音
生若谷
李聪超
刘晓南
施尼盖斯丹尼尔
Original Assignee
西门子(中国)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 西门子(中国)有限公司 filed Critical 西门子(中国)有限公司
Priority to CN201980102556.2A priority Critical patent/CN114746855A/zh
Priority to PCT/CN2019/129423 priority patent/WO2021128342A1/zh
Publication of WO2021128342A1 publication Critical patent/WO2021128342A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation

Definitions

  • This application relates to the field of natural language processing, and more specifically, to a method and device for document processing.
  • Natural language processing focuses on representing natural language as data that can be processed by a computer, for example, representing a document described in natural language as a form of vector. In this way, the computer can perform subsequent processing on the information in the document, thereby providing intelligent feedback based on the processing result.
  • Natural language processing can be applied to a variety of application scenarios, such as text classification, search engines, recommendation systems, diagnosis systems, or fault handling systems.
  • a fault handling system based on historical fault description documents, it can provide solutions to new faults and prevent new faults from occurring.
  • the text in the historical fault description document is processed.
  • Document Representation is a key link, which determines the effectiveness of subsequent post-processing.
  • the present application provides a method and device for document processing, which can improve the efficiency of document characterization.
  • a document processing method includes: determining that the qth sentence among the N sentences is in all the N sentences according to the similarity between the sentence of the i-th document and the N sentences of the M documents.
  • the technical solution of the embodiment of the present application determines the corresponding frequency of the sentence in the document according to the similarity between the sentences, and determines the document representation of the document according to the corresponding frequency of the sentence in the document, which can reduce the dimensionality of the data and reduce the processing required Resource requirements, which can improve the efficiency of document characterization.
  • the i-th document is a document in the M documents.
  • the i-th document when i is greater than M, the i-th document is a document outside the M documents.
  • the determining the document representation of the i-th document includes: determining the corresponding frequency of the q-th sentence in the M documents according to the corresponding frequency of the q-th sentence in the M documents. Inverse document frequency idf q ; determine the i-th document according to the frequency x i,q corresponding to the q-th sentence in the i-th document and the inverse document frequency idf q corresponding to the q-th sentence Document representation.
  • the inverse document frequency idf q corresponding to the qth sentence is negatively related to the number of documents containing the qth sentence in the M documents.
  • the pth document contains the qth sentence, p A positive integer not greater than M.
  • the corresponding frequency x i of the qth sentence in the i-th document, q is the similarity between the qth sentence and each sentence of the i-th document ⁇ The sum.
  • the determining the frequency x i,q corresponding to the qth sentence in the i-th document among the N sentences includes: according to the sentence of the i-th document in the Describe the K most similar sentences among the N sentences, and determine the frequency x i,q corresponding to the qth sentence in the i-th document, where K is a positive integer less than N.
  • K and N are positively correlated.
  • the corresponding frequency x i of the qth sentence in the i-th document, q is the K closest to each sentence of the qth sentence and the i-th document The sum of adjacent similarities; wherein, if the qth sentence belongs to the most similar K sentences of the jth sentence of the i-th document, then the qth sentence and the i-th document The K nearest neighbor similarity of the jth sentence is the similarity between the qth sentence and the jth sentence of the i-th document; if the qth sentence does not belong to the jth sentence of the i-th document The K most similar sentences of the sentence, then the K nearest neighbor similarity between the qth sentence and the jth sentence of the i-th document is zero, j is a positive integer not greater than n i , and n i is all The number of sentences describing the i-th document.
  • the frequency x i corresponding to the qth sentence in the i-th document , q is the qth element of the N-dimensional vector x i, where, Among them, for the vector N-dimensional , If the lth sentence in the N sentences belongs to the The K most similar sentences of the sentences, then The lth element of is the lth sentence and the The similarity of sentences, otherwise, The l th element of is zero, where the th Sentence is the jth sentence of the i-th document, n i is the number of sentences in the i-th document, j is a positive integer not greater than n i , and l is a positive integer not greater than N.
  • n i sentences of the i-th document are summed, so that the similarity information of the n i sentences can be combined, so that the structured information of the document can be reflected.
  • the determining the inverse document frequency idf q corresponding to the qth sentence includes: determining the inverse document frequency idf q corresponding to the qth sentence according to the following formula, Among them,
  • the determining the document representation of the i-th document includes: determining the document representation z i of the i-th document according to the following formula, Among them,
  • represents a 2-norm, and the q-th element y i,q of the N-dimensional vector y i is, y i, q xi, q *idf q .
  • the product of the corresponding frequency of the sentence in the document based on the similarity and the inverse document frequency of the sentence is used as the weight of the sentence in the document.
  • the document representation obtained based on this can reflect the structured information of the document and the structured information between the documents , which can effectively characterize the document.
  • the method further includes: obtaining the sentence representation of the sentence according to the first model; determining the similarity between sentences according to the sentence representation of the sentence; wherein, the first model is based on multiple words
  • the first model connects a plurality of vectors to obtain a sentence representation of a sentence, and the plurality of vectors are vectors of words of the sentence obtained according to the plurality of word embedding models.
  • the document is a Chinese document
  • the method further includes: performing word segmentation processing on the sentence to obtain the words of the sentence.
  • the word segmentation processing of the sentence includes: obtaining the initial word sequence of the sentence; according to the general dictionary and the professional dictionary in the field to which the document belongs, performing reverse maximum matching on the initial word sequence to obtain Common words in the sentence and words specific to the field.
  • the method further includes: performing document processing according to the document characterization.
  • the i-th document is a fault description document corresponding to an unresolved fault
  • the document processing according to the document characterization includes: determining the M documents according to the document characterization The reference document with the highest degree of similarity to the i-th document, where the reference document is a fault description document corresponding to the resolved fault, and the solution to the fault corresponding to the reference document is used to process the i-th document Corresponding failure.
  • the similarity is cosine similarity.
  • a document processing apparatus which includes a module that executes the method in the above-mentioned first aspect or any possible implementation manner thereof.
  • a document processing device including: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the processor uses To perform the above-mentioned document processing method.
  • the present application also provides a computer-readable storage medium that stores program code for device execution, and the program code includes instructions for executing steps in the above-mentioned document processing method.
  • the present application also provides a computer program product, the computer program product includes a computer program stored on a computer-readable storage medium, the computer program includes program instructions, when the program instructions are executed by a computer , To make the computer execute the above-mentioned document processing method.
  • Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application.
  • Fig. 2 is a schematic diagram of a document processing process in an embodiment of the present application.
  • Fig. 3 is a schematic flowchart of a document processing method according to an embodiment of the present application.
  • Fig. 4 is a schematic block diagram of a document processing apparatus according to an embodiment of the present application.
  • Fig. 5 is a schematic block diagram of a document processing apparatus according to another embodiment of the present application.
  • Fig. 6 is a schematic structural diagram of a document processing apparatus according to an embodiment of the present application.
  • processing module
  • processing unit
  • the size of the sequence number of each process does not mean the order of execution.
  • the execution order of each process should be determined by its function and internal logic, and should not correspond to the difference in the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application.
  • the data generating device 120 is a device that generates data to be processed, for example, a device that generates a document to be processed. There may be multiple data generating devices 120.
  • the document generated by the data generating device 120 may be directly transmitted to the processing device 110, or may be stored in the database 140 first, and then obtained from the database 140 by the processing device 110.
  • the fault description document as an example, the fault description document newly generated by the data generating device 120 can be transmitted to the processing device 110 in real time; and the processing device 110 obtains historical documents from the database 140.
  • the processing device 110 is in communication connection with the data generating device 120.
  • the processing device 110 may include a communication interface 112 to implement a communication connection with other devices.
  • the communication connection can be wired or wireless.
  • the processing device 110 may be an electronic device or system with data processing capabilities, such as a computer.
  • the processing device 110 may include a processing module 111 for implementing data processing, for example, using the technical solutions of the embodiments of the present application to perform document characterization.
  • the processing module 111 may specifically be one or more processors.
  • the processor may be any type of processor, which is not limited in the embodiment of the present application.
  • the processing device 110 may also include a storage system 113.
  • the storage system 113 may be used to store data and instructions, for example, computer-executable instructions that implement the technical solutions of the embodiments of the present application.
  • the processing device 110 can call data, instructions, etc. in the storage system 113, and can also store data, instructions, etc. in the storage system 113.
  • the storage system 113 may specifically be one or more memories.
  • the memory can be any type of memory, which is not limited in the embodiment of the present application.
  • the storage system 113 may be installed in the processing device 110 or outside the processing device 110. In the case where the storage system 113 is provided outside the processing device 110, the processing device 110 may implement access to the storage system 113 through a data interface.
  • the processing device 110 may also include other general equipment, such as an output device, for outputting processed data to the user.
  • other general equipment such as an output device, for outputting processed data to the user.
  • the processing module 111 may include a preprocessing module 114 for preprocessing the acquired data. For example, perform word segmentation processing on documents.
  • a trained model 115 may be configured in the processing device 110.
  • the processing module 111 can use the model 115 to perform corresponding processing.
  • the model 115 may be a sentence embedding model for sentence characterization.
  • the training device 130 may train based on the training data in the corpus 150 to obtain a sentence embedding model.
  • the processing module 111 can use the sentence embedding model to obtain the characterization of the sentence.
  • the word segmentation process can be performed through the preprocessing module 114 to obtain the words of the sentence; then it is input into the model 115 to obtain the characterization of the sentence; and then the following implementation of this application is used The technical solution of the example is documented.
  • the processing device 110 may be a data generating device at the same time. In this case, the processing device 110 may generate a document to be processed and process the document.
  • the processing device 110 may be a training device at the same time. In this case, the processing device 110 may train the model 115 first, and then use the model 115, and may also train the model 115 at the same time during use.
  • FIG. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the training device 130 trains the model 115, which can be a model based on deep learning, for example, it can be a model built based on a neural network, where the neural network can be a convolutional neural network (convolutional neural network). , CNN), recurrent neural network (recurrent neural network, RNN), deep convolutional neural network (deep convolutional neural networks, DCNN), etc.
  • CNN convolutional neural network
  • RNN recurrent neural network
  • DCNN deep convolutional neural networks
  • Figure 2 shows a schematic diagram of a document processing process in an embodiment of the present application.
  • the word segmentation tool 202 can be used for word segmentation to obtain the initial word sequence 203; and then the processed word 207 can be obtained by inverse maximum matching 206 (inverse maximum matching).
  • the above process can be a pretreatment process. Since documents in the industrial field have more special words, a general dictionary obtained from an external data source and a professional dictionary in the field of the document configured on the client side can be used for reverse maximum matching.
  • the preprocessed word 207 can obtain a sentence vector 209 (sentence representation) through the sentence embedding model 208.
  • the sentence embedding model 208 may be a pre-trained model. The sentence embedding model will be described in detail below.
  • document characterization can be performed to obtain a document characterization 210.
  • the detailed scheme of the document characterization will be described below.
  • subsequent processing 211 may be performed, for example, document analysis, comparison, classification, and classifier training.
  • TFIDF frequency-inverse document frequency
  • the TFIDF method is used to evaluate the importance of a word to a document set or a document in a corpus. The importance of a word increases in proportion to the number of times it appears in the document, but at the same time it decreases in inverse proportion to its corresponding frequency in the corpus. If a word has a high corresponding frequency in a document and rarely appears in other documents, it is considered that the word has good classification ability and is suitable for classification.
  • TFIDF is actually: TF*IDF, where TF is the term frequency (Term Frequency), and IDF is the inverse document frequency (Inverse Document Frequency).
  • TF represents the corresponding frequency of the word in the document.
  • the main idea of IDF is: if there are fewer documents containing the word, the larger the IDF, indicating that the word has a good ability to distinguish categories.
  • the IDF of a word can be obtained by dividing the total number of documents by the number of documents containing the word, and then taking the obtained quotient to a logarithm with the base 10.
  • the TFIDF method uses the TFIDF value of a word to form a document representation (vector). Therefore, the dimension of the vector in the TFIDF method is determined by the number of words. However, due to the relatively large number of words in the document, this method is prone to dimensional disasters, which affects the efficiency of document representation.
  • the embodiments of the present application provide an improved technical solution to obtain the "TFIDF" of the sentence based on the similarity of the sentence.
  • the technical solution of the embodiment of the present application may be referred to as a pseudo TFIDF document characterization method.
  • the technical solutions of the embodiments of the present application will be described in detail below.
  • Fig. 3 shows a schematic flowchart of a document processing method according to an embodiment of the present application. This method may be executed by the processing device 110 in FIG. 1.
  • the i-th document when i is not greater than M, the i-th document is a document in the M documents.
  • the i-th document when i is greater than M, is a document outside the M documents.
  • the technical solutions of the embodiments of the present application include technical solutions in two cases.
  • One case is to characterize M documents together, that is, when i is not greater than M, characterize the i-th document among the M documents. ;
  • Another situation is to characterize a new document based on the M documents that have been characterized, that is, when i is greater than M, the documents outside the M documents are characterized.
  • the documents mentioned above may be a type of documents in a specific field.
  • a fault handling system in the industrial field it may be a fault description document of the same type of equipment.
  • the fault description document may include the event description of the fault, and for the processed fault, it may also include the cause, solution, or classification of the fault.
  • the i-th fault description of the document can be expressed as D i, or, (D i, o i) , where D i represents the fault event description part, O i represents the cause of the failure corresponding to the resolution of other portions of the program or classification.
  • D i represents the fault event description part
  • O i represents the cause of the failure corresponding to the resolution of other portions of the program or classification.
  • the document is the former; for resolved faults, the document includes the above two parts.
  • the document representation can only focus on D i. In subsequent applications, the corresponding o i can be obtained according to D i again.
  • the technical solutions of the embodiments of the present application can be applied to various documents and are not limited to the foregoing examples.
  • the embodiment of the present application does not limit the language type of the document, for example, it may be Chinese or non-Chinese.
  • corresponding technical solutions are further provided for Chinese documents.
  • document characterization is performed based on the frequency of the sentence in the document, where the frequency of the sentence in the document is obtained by the similarity between sentences. That is to say, the frequency is not obtained by the number of times the sentence actually appears in the document, but the weight value obtained by using the similarity between the sentences.
  • the frequency x i corresponding to the q-th sentence in the i-th document , q is the sum of similarity between the q-th sentence and each sentence of the i-th document. That is to say, the corresponding frequency of a sentence in a document is the sum of the similarity between the sentence and all sentences in the document.
  • the similarity is a cosine similarity, but the embodiment of the present application does not limit this.
  • Cosine similarity also known as cosine similarity, is to evaluate the similarity of two vectors by calculating the cosine of the angle between them.
  • each sentence can be represented as a multi-dimensional vector, and the cosine similarity of two sentences is the cosine value of the angle between the two vectors representing the two sentences.
  • K-Nearest Neighbor K-Nearest Neighbor, KNN
  • the frequency of the sentence in the document can be determined according to the K most similar sentences in the sentence space (the above N sentences), where K is a positive integer less than N.
  • the K most similar sentences of a sentence are the K nearest neighbors of the sentence in the sentence space, that is, the K sentences have the highest similarity to the sentence.
  • the sentence space is composed of N sentences of M documents.
  • the sentence of the i-th document is the sentence in the N sentences, and the K most similar sentences in the N sentences are used to determine the frequency x i, q .
  • the sentence of the i-th document is not a sentence in the N sentences, but the K most similar sentences in the N sentences are still used to determine Frequency x i, q .
  • the value of K can be predetermined based on the total number of sentences, and can also be adjusted continuously.
  • K can be positively correlated with N, that is, when N is larger, K can take a larger value.
  • the value of K can be reduced, and vice versa, the value of K can be increased.
  • the value range of K may be 5-30, but the embodiment of the present application does not limit this.
  • the similarity between a sentence and each sentence in a document can be set using the KNN algorithm. Specifically, if the sentence belongs to the K most similar sentences of a certain sentence of the document, the similarity can be the similarity of the two sentences, otherwise it can be set to zero.
  • the similarity after the KNN algorithm is used is expressed as the K nearest neighbor similarity.
  • the q-th sentence belongs to the K most similar sentences of the j-th sentence of the i-th document, then the q-th sentence and the j-th sentence of the i-th document have K
  • the nearest neighbor similarity is the similarity between the q-th sentence and the j-th sentence of the i-th document; if the q-th sentence does not belong to the most similar sentence of the j-th sentence of the i-th document K sentences, then the K nearest neighbor similarity between the qth sentence and the jth sentence of the i-th document is zero, j is a positive integer not greater than n i , and n i is the i-th document The number of sentences.
  • the corresponding frequency x i of the q-th sentence in the i-th document, q is the K nearest neighbor of each sentence of the q-th sentence and the i-th document is similar The sum of degrees.
  • the following methods can be used to determine the corresponding frequency of the sentence in the document.
  • the lth element of is the lth sentence and the The similarity of sentences (or the lth element is 1), otherwise, The l-th element of is zero;
  • the q-th element x i,q of the N-dimensional vector x i is the frequency corresponding to the q-th sentence among the N sentences in the i-th document.
  • n i is the number of sentences in the i-th document, j is a positive integer not greater than n i , and l is a positive integer not greater than N.
  • n i sentences of the i-th document are summed, so that the similarity information of the n i sentences can be combined, so that the structured information of the document can be reflected.
  • the corresponding frequency of the qth sentence in each of the M documents can be obtained.
  • the inverse document frequency idf q corresponding to the qth sentence can be further determined according to the frequency corresponding to the qth sentence in the M documents.
  • the inverse document frequency corresponding to the sentence can be obtained next.
  • the inverse document frequency idf q corresponding to the qth sentence may be negatively related to the number of documents containing the qth sentence in the M documents. That is to say, the greater the number of documents containing the qth sentence in the M documents, the smaller the inverse document frequency idf q corresponding to the qth sentence; the M documents containing the qth sentence The smaller the number of documents of q sentences, the larger the inverse document frequency idf q corresponding to the qth sentence. In this way, using the inverse document frequency idf q can reflect the distinguishability of sentences.
  • the pth document contains the qth sentence, and p is not greater than Positive integer of M. In other words, whether the document contains a sentence can be judged by whether the corresponding frequency of the sentence in the document is greater than zero.
  • the inverse document frequency idf q corresponding to the qth sentence may be determined according to the following formula,
  • formula (2) shows that if the qth sentence appears in more documents, the inverse document frequency idf q is smaller; if the qth sentence appears in fewer documents, then the inverse document frequency idf q Bigger. Therefore, the inverse document frequency idf q can reflect the discrimination of sentences.
  • the following can be done according to the frequency of the q-th sentence in the i-th document.
  • the corresponding frequency x i,q in the i-th document and the inverse document frequency idf q corresponding to the q-th sentence are used to determine the document representation of the i-th document.
  • the frequency of the inverse document corresponding to the sentence and the frequency of the sentence in the document can be combined to represent the weight of the sentence in the document, and then the characterization of the document can be obtained.
  • the product of the frequency of the sentence in the document and the inverse document frequency of the sentence may be used as the weight of the sentence in the document.
  • the weight of the q-th sentence in the i-th document can be:
  • y i, q are used as the q-th element of the N-dimensional vector y i corresponding to N sentences. Normalize y i to get the document representation z i of the i-th document,
  • the N-dimensional vector representation of the i-th document can be obtained. That is to say, the dimension of the document representation obtained by the technical solution of the embodiment of the present application is the number of sentences. Compared with the characterization method of the word quantity dimension, the data dimension of the technical solution of the embodiment of the present application is reduced, thereby reducing the amount of data calculation and improving the efficiency of document characterization.
  • representation feature matrix z for all documents (representation feature matrix z for all documents);
  • M number of documents; (number of documents);
  • n i number of sentences in ith document (the number of sentences in the i-th document);
  • N number of sentences (number of sentences)
  • step1 Sequentially assign an index number index from 1 to N for every sentence (assign index for each sentence in order from 1 to N );
  • step2 for every do (for each ):
  • distance metric such as cosine similarity (by distance matrix, for example, cosine similarity, find K nearest neighbors in the sentence space);
  • step3 for every D i do (for every D i ):
  • Step4 (Step 4): Let x i, q represent the qth component in x i (x i, q represents the q-th element of x i); So we can define (and therefore define)
  • Step5 for every D i do (for every D i )
  • step6 (step 6): Based on do the clustering task (based on Perform classification task); or train a classifier for classification task using (Or use Train a classifier for classification tasks).
  • the fault description document D includes three sentences s1, s2, and s3, and the corresponding sentence vectors are v1, v2, and v3.
  • the most similar K (for easy understanding, set to 3) sentences are ⁇ s1, s ⁇ 4s ⁇ , s6 ⁇ s, ⁇ s2, ⁇ .
  • the pseudo TF representation of document D can be obtained according to formula (1): [1, 1, 1 , 0.8, 0, 2.47, 0.95, 0, 0, 0.88... 0, 0, 0, 0, 0] N.
  • the pseudo TFIDF document characterization can be further obtained.
  • the technical solution of the embodiment of the present application performs document characterization based on the corresponding frequency of the sentence in the document based on the similarity and the inverse document frequency corresponding to the sentence, and the obtained document characterization can reflect the structured information of the document and the structured information between the documents. , And the document representation is sparse, so it can effectively represent the document.
  • the dimensions of the pseudo TFIDF characterization scheme in the embodiments of the present application are greatly reduced. Therefore, faster document processing speed can be obtained without too much computing and storage resources, thereby improving document processing. s efficiency.
  • the document characterization scheme of the embodiment of the present application does not require a large amount of training data, thereby avoiding the problem of lack of training data due to the confidentiality of industrial documents.
  • the characterization of each sentence that is, the vector of each sentence
  • the characterization of the document can be obtained in various ways, and then the characterization of the document can be obtained in the above manner.
  • the sentence representation of the sentence can be obtained according to the first model (sentence embedding model).
  • the sentence embedding model may be a pre-trained model.
  • the sentence embedding model can be pre-trained through deep learning.
  • the sentence embedding model may not be trained, but a sentence embedding model based on multiple word embedding models may be directly used.
  • the first model may be obtained based on multiple word embedding models.
  • the simplest method for finding sentence embedding is to average the word embedding of all the words in the sentence, but the effect is poor.
  • the operation of averaging word embedding can be generalized to a type of p-mean operation, and then different p-values can be used to generate different features.
  • the p-mean of u i can be expressed as:
  • the above three operations can be put together to use the lifting effect.
  • word embedding models can be used, such as word2vec, fasttext, glove, etc.
  • each model is embedding (embedding, that is, vector) on the p-mean operation, and then connect the results.
  • sentence representation sentence embedding
  • u, v, w represent word vectors obtained from different word embedding models
  • s p represents a sentence vector for p ⁇ ⁇ 1
  • ⁇ , n represents the number of words in the sentence, through the symbol Connect different sp as a combined sentence vector.
  • each word embedding model is 100.
  • the dimensions of s 1 , s + ⁇ and s - ⁇ are all 300.
  • 900 dimensions can be obtained by vector connection Vector s.
  • a sentence embedding model can be obtained.
  • multiple vectors of the words of the sentence are obtained, and the sentence embedding model can connect the multiple vectors to obtain the sentence representation (vector) of the sentence.
  • the sentence representation of each sentence can be obtained according to the sentence embedding model.
  • the similarity between sentences is determined, and then the document representation is obtained by the aforementioned method.
  • the sentence can be segmented in the following manner:
  • an existing word segmentation tool such as jieba
  • the initial word sequence can be reversed maximum matching according to the automatically created dictionary, wherein the automatically created dictionary
  • the dictionaries include general dictionaries obtained from external data sources and professional dictionaries in the field to which the documents configured on the client side belong. After the above reverse maximum matching, a word segmentation that meets the understanding of general and professional fields can be obtained, so that the quality of word segmentation can be improved.
  • the sentence words obtained in the above manner can be used for the aforementioned sentence characterization and document characterization, so that the efficiency of Chinese document characterization can be improved.
  • subsequent document processing can be performed according to the document characterization of the document, for example, document analysis, comparison, classification, and classifier training.
  • document processing related to fault handling may be performed according to the document characterization of the document.
  • the reference document with the highest similarity to the i-th document among the M documents may be determined according to the document characterization, wherein, the The reference document is a fault description document corresponding to the resolved fault, and the solution of the fault corresponding to the reference document is used to process the fault corresponding to the i-th document. That is to say, for newly-occurring faults, the historical document with the highest similarity can be found based on its documents, and the newly-occurring fault can be dealt with according to the solution of the fault corresponding to the historical document.
  • FIG. 4 shows a schematic block diagram of a document processing apparatus 400 according to an embodiment of the present application.
  • the device 400 can execute the above-mentioned document processing method of the embodiment of the present application.
  • the device 400 can be the aforementioned processing device 110.
  • the apparatus 400 may include:
  • the obtaining unit 410 is used to obtain M documents
  • the document characterization unit 420 is configured to determine, according to the similarity between the sentence of the i-th document and the N sentences of the M documents, that the q-th sentence of the N sentences corresponds to the i-th document Frequency x i,q , where M and N are both integers greater than 1, and q is a positive integer not greater than N; according to the frequency x i,q corresponding to the qth sentence in the i-th document, Determine the document representation of the i-th document.
  • the i-th document when i is not greater than M, the i-th document is a document in the M documents.
  • the i-th document when i is greater than M, is a document outside the M documents.
  • the document characterization unit 420 is specifically configured to:
  • the inverse document frequency idf q corresponding to the qth sentence is negatively related to the number of documents containing the qth sentence in the M documents.
  • the pth document contains the qth document.
  • Sentence p is not greater than a positive integer of M.
  • the corresponding frequency x i of the qth sentence in the i-th document, q is each of the qth sentence and the i-th document The sum of the similarity of sentences.
  • the document characterization unit 420 is specifically configured to:
  • K and N are positively correlated.
  • the corresponding frequency x i of the qth sentence in the i-th document, q is each of the qth sentence and the i-th document The sum of the K nearest neighbor similarity of the sentence;
  • the k-th sentence is closest to the j-th sentence of the i-th document
  • the neighbor similarity is the similarity between the qth sentence and the jth sentence of the i-th document;
  • the q-th sentence does not belong to the K most similar sentences of the j-th sentence of the i-th document, then the q-th sentence is the K nearest neighbor of the j-th sentence of the i-th document.
  • the similarity is zero, j is a positive integer not greater than n i , and n i is the number of sentences in the i-th document.
  • the frequency x i corresponding to the q-th sentence in the i-th document, q is the q-th element of the N-dimensional vector x i,
  • the lth element of is the lth sentence and the The similarity of sentences, otherwise, The l th element of is zero, where the th Sentence is the jth sentence of the i-th document, n i is the number of sentences in the i-th document, j is a positive integer not greater than n i , and l is a positive integer not greater than N.
  • the document characterization unit 420 is specifically configured to:
  • the document characterization unit 420 is specifically configured to:
  • represents the norm, N-dimensional vector y i q-th element of y i, q, for
  • the apparatus 400 may further include:
  • the sentence representation unit 430 is used to obtain the sentence representation of the sentence according to the first model
  • the document characterization unit is also used to determine the similarity between sentences according to the sentence characterization of the sentence;
  • the first model is obtained based on multiple word embedding models, the first model connects multiple vectors to obtain sentence representations of sentences, and the multiple vectors are sentences obtained according to the multiple word embedding models Of different words.
  • the document is a Chinese document, as shown in FIG. 5, the apparatus 400 may further include:
  • the word segmentation unit 440 is used to perform word segmentation processing on the sentence to obtain general words in the sentence and words specific to the field.
  • the word segmentation unit 440 is specifically configured to:
  • reverse maximum matching is performed on the initial word sequence to obtain the words of the sentence.
  • the apparatus 400 further includes:
  • the processing unit 450 is configured to perform document processing according to the document characterization.
  • the i-th document is a fault description document corresponding to an unresolved fault
  • the processing unit 450 is specifically configured to:
  • the reference document with the highest similarity to the i-th document among the M documents is determined, wherein the reference document is the fault description document corresponding to the resolved fault, and the reference document corresponds to the fault
  • the solution is used to deal with the fault corresponding to the i-th document.
  • the similarity is cosine similarity.
  • FIG. 6 is a schematic diagram of the hardware structure of the document processing apparatus according to an embodiment of the present application.
  • the document processing apparatus 600 shown in FIG. 6 includes a memory 601, a processor 602, a communication interface 603, and a bus 604. Among them, the memory 601, the processor 602, and the communication interface 603 implement communication connections between each other through the bus 604.
  • the memory 601 may be a read-only memory (ROM), a static storage device and a random access memory (RAM).
  • the memory 601 may store a program. When the program stored in the memory 601 is executed by the processor 602, the processor 602 and the communication interface 603 are used to execute each step of the document processing method of the embodiment of the present application.
  • the processor 602 may adopt a general-purpose central processing unit (CPU), a microprocessor, an application specific integrated circuit (ASIC), a graphics processing unit (GPU), or one or more
  • the integrated circuit is used to execute related programs to realize the functions required by the units in the document processing apparatus of the embodiment of the present application, or to execute the document processing method of the embodiment of the present application.
  • the processor 602 may also be an integrated circuit chip with signal processing capability.
  • each step of the document processing method of the embodiment of the present application can be completed by an integrated logic circuit of hardware in the processor 602 or instructions in the form of software.
  • the aforementioned processor 602 may also be a general-purpose processor, digital signal processing (DSP), ASIC, ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic Devices, discrete hardware components.
  • DSP digital signal processing
  • FPGA ready-made programmable gate array
  • the methods, steps, and logical block diagrams disclosed in the embodiments of the present application can be implemented or executed.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the steps of the method disclosed in combination with the embodiments of the present application may be directly embodied as being executed and completed by a hardware processor, or executed and completed by a combination of hardware and software modules in the processor.
  • the software module can be located in a mature storage medium in the field, such as random access memory, flash memory, read-only memory, programmable read-only memory, or electrically erasable programmable memory, registers.
  • the storage medium is located in the memory 601, and the processor 602 reads the information in the memory 601, and combines its hardware to complete the functions required by the units included in the document processing apparatus of the embodiment of the present application, or perform the document processing of the embodiment of the present application Methods.
  • the communication interface 603 uses a transceiving device such as but not limited to a transceiver to implement communication between the device 600 and other devices or a communication network.
  • a transceiving device such as but not limited to a transceiver to implement communication between the device 600 and other devices or a communication network.
  • the document to be characterized can be obtained through the communication interface 603.
  • the bus 604 may include a path for transferring information between various components of the device 600 (for example, the memory 601, the processor 602, and the communication interface 603).
  • the device 600 may also include other devices necessary for normal operation.
  • the apparatus 600 may also include hardware devices that implement other additional functions.
  • the device 600 may also include only the necessary components for implementing the embodiments of the present application, and not necessarily all components shown in FIG. 6.
  • the embodiment of the present application also provides a computer-readable storage medium that stores program code for device execution, and the program code includes instructions for executing steps in the above-mentioned document processing method.
  • the embodiments of the present application also provide a computer program product.
  • the computer program product includes a computer program stored on a computer-readable storage medium.
  • the computer program includes program instructions. When the program instructions are executed by a computer, the computer program The computer executes the above-mentioned document processing method.
  • the aforementioned computer-readable storage medium may be a transitory computer-readable storage medium or a non-transitory computer-readable storage medium.
  • the disclosed device and method may be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components may be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the various aspects, implementations, implementations or features in the described embodiments can be used alone or in any combination. All aspects in the described embodiments can be implemented by software, hardware, or a combination of software and hardware.
  • the described embodiments may also be embodied by a computer readable medium storing computer readable code, the computer readable code including instructions executable by at least one computing device.
  • the computer-readable medium can be associated with any data storage device capable of storing data, which can be read by a computer system.
  • the computer-readable media used for example may include read-only memory, random access memory, compact disc read-only memory (CD-ROM), hard disk drive (HDD), digital Video disc (Digital Video Disc, DVD), magnetic tape, optical data storage device, etc.
  • the computer-readable medium may also be distributed in computer systems connected through a network, so that the computer-readable code can be stored and executed in a distributed manner.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文档处理的方法和装置。该方法包括:根据第i个文档的句子与M个文档的N个句子间的相似度,确定所述N个句子中第q个句子在所述第i个文档中对应的频率x i q,其中,M和N均为大于1的整数,q为不大于N的正整数,i为正整数;根据所述第q个句子在所述第i个文档中对应的频率x i q,确定所述第i个文档的文档表征。能够提升文档表征的效率。

Description

文档处理的方法和装置 技术领域
本申请涉及自然语言处理领域,并且更为具体地,涉及一种文档处理的方法和装置。
背景技术
自然语言处理关注的是将自然语言表征为计算机能够处理的数据,例如,将采用自然语言描述的文档表征为向量的形式。这样,计算机能够对该文档中的信息进行后续的处理,从而可以基于处理结果提供智能的反馈。
自然语言处理可以应用于多种应用场景中,例如,文本分类、搜索引擎、推荐系统、诊断系统或故障处理系统等。例如,对于故障处理系统,可以基于历史故障描述文档,给新发生的故障提供解决方案以及预防新的故障发生等。比如,将历史故障描述文档中的文本进行处理等。在上述应用中,文档表征(Document Representation)是其中的关键环节,其决定了后续的后处理的有效性。
然而,文档表征往往需要处理大量的数据,影响了文档表征的效率。因此,如何提升文档表征的效率,成为一个亟待解决的技术问题。
发明内容
本申请提供了一种文档处理的方法和装置,能够提升文档表征的效率。
第一方面,提供了一种文档处理的方法,该方法包括:根据第i个文档的句子与M个文档的N个句子间的相似度,确定所述N个句子中第q个句子在所述第i个文档中对应的频率x i,q,其中,M和N均为大于1的整数,q为不大于N的正整数,i为正整数;根据所述第q个句子在所述第i个文档中对应的频率x i,q,确定所述第i个文档的文档表征。
本申请实施例的技术方案,根据句子间的相似度确定句子在文档中对应的频率,并根据句子在文档中对应的频率,确定文档的文档表征,可以降低 数据的维度,降低处理所需的资源需求,从而能够提高文档表征的效率。
在一些可能的实现方式中,当i不大于M时,所述第i个文档为所述M个文档中的文档。
在一些可能的实现方式中,当i大于M时,所述第i个文档为所述M个文档外的文档。
在一些可能的实现方式中,所述确定所述第i个文档的文档表征,包括:根据所述第q个句子在所述M个文档中对应的频率,确定所述第q个句子对应的逆文档频率idf q;根据所述第q个句子在所述第i个文档中对应的频率x i,q和所述第q个句子对应的逆文档频率idf q,确定所述第i个文档的文档表征。
在一些可能的实现方式中,所述第q个句子对应的逆文档频率idf q与所述M个文档中包含所述第q个句子的文档的数量负相关。
在一些可能的实现方式中,若所述第q个句子在所述M个文档中的第p个文档中对应的频率大于零,则所述第p个文档包含所述第q个句子,p不大于M的正整数。
在一些可能的实现方式中,所述第q个句子在所述第i个文档中对应的频率x i,q为所述第q个句子与所述第i个文档的每个句子的相似度的和。
在一些可能的实现方式中,所述确定所述N个句子中第q个句子在所述第i个文档中对应的频率x i,q,包括:根据所述第i个文档的句子在所述N个句子中的最相似的K个句子,确定所述第q个句子在所述第i个文档中对应的频率x i,q,其中,K为小于N的正整数。
采用最相似的K个句子,可以减少计算量,保证文档表征的效率。
在一些可能的实现方式中,K与N正相关。
在一些可能的实现方式中,所述第q个句子在所述第i个文档中对应的频率x i,q为所述第q个句子与所述第i个文档的每个句子的K最近邻相似度的和;其中,若所述第q个句子属于所述第i个文档的第j个句子的最相似的K个句子,则所述第q个句子与所述第i个文档的第j句子的K最近邻相似度为所述第q个句子与所述第i个文档的第j句子的相似度;若所述第q个句子不属于所述第i个文档的第j个句子的最相似的K个句子,则所述第q个句子与所述第i个文档的第j句子的K最近邻相似度为零,j为不大于 n i的正整数,n i为所述第i个文档的句子的数量。
在一些可能的实现方式中,所述第q个句子在所述第i个文档中对应的频率x i,q为N维向量x i的第q个元素,其中,
Figure PCTCN2019129423-appb-000001
其中,对于向量N维
Figure PCTCN2019129423-appb-000002
,若所述N个句子中的第l个句子属于第
Figure PCTCN2019129423-appb-000003
个句子的最相似的K个句子,则
Figure PCTCN2019129423-appb-000004
的第l个元素为所述第l个句子与所述第
Figure PCTCN2019129423-appb-000005
个句子的相似度,否则,
Figure PCTCN2019129423-appb-000006
的第l个元素为零,其中,所述第
Figure PCTCN2019129423-appb-000007
个句子为所述第i个文档的第j个句子,n i为所述第i个文档的句子的数量,j为不大于n i的正整数,l为不大于N的正整数。
对第i个文档的n i个句子进行了求和,这样可以将n i个句子的相似度信息联合起来,从而能够体现文档的结构化信息。
在一些可能的实现方式中,所述确定所述第q个句子对应的逆文档频率idf q,包括:根据以下公式确定所述第q个句子对应的逆文档频率idf q
Figure PCTCN2019129423-appb-000008
其中,|*|表示集合的基数。
采用上述逆文档频率,通过句子在文档中对应的频率是否大于零表示该句子是否在该文档中出现过,而且,句子在越多文档中出现过,则逆文档频率越小,句子在越少文档中出现过,则逆文档频率越大,这样可以体现句子的区分性,从而能够提高文档表征的效果。
在一些可能的实现方式中,所述确定所述第i个文档的文档表征,包括:根据以下公式确定所述第i个文档的文档表征z i
Figure PCTCN2019129423-appb-000009
其中,||*||表示2范数,N维向量y i的第q个元素y i,q为,y i,q=x i,q*idf q
将基于相似度得到的句子在文档中对应的频率和句子对应的逆文档频率的乘积作为句子在文档中的权重,基于此获得的文档表征能够反映文档的结构化信息和文档间的结构化信息,从而能够有效的表征文档。
在一些可能的实现方式中,所述方法还包括:根据第一模型,获取句子的句子表征;根据句子的句子表征,确定句子间的相似度;其中,所述第一模型为基于多个词嵌入模型得到的,所述第一模型对多个向量进行连接,得到句子的句子表征,所述多个向量为根据所述多个词嵌入模型获取的句子的词的向量。
不同的嵌入模型可以带来补偿信息,而且不同的P值也可以引入丰富的信息,因此,上述模型能够提高表征的效果。
在一些可能的实现方式中,所述文档为中文文档,所述方法还包括:对句子进行分词处理,得到句子的词。
在一些可能的实现方式中,所述对句子进行分词处理,包括:获取句子的初始词序列;根据通用词典和所述文档所属领域的专业词典,对所述初始词序列进行逆向最大匹配,得到句子中的通用词和所述领域专用的词。
经过上述逆向最大匹配后,可以得到符合通用领域和专业领域理解的分词,从而能够提高分词的质量。
在一些可能的实现方式中,所述方法还包括:根据所述文档表征,进行文档处理。
在一些可能的实现方式中,所述第i个文档为未解决故障对应的故障描述文档;所述根据所述文档表征,进行文档处理,包括:根据所述文档表征,确定所述M个文档中与所述第i个文档相似度最高的参考文档,其中,所述参考文档为已解决故障对应的故障描述文档,所述参考文档对应的故障的解决方案用于处理所述第i个文档对应的故障。
采用本申请实施例的技术方案进行故障处理,可以有效地基于历史文档自动获得解决方案,无需人工进行处理,从而能够节省时间和人工成本,提高处理效率。
在一些可能的实现方式中,所述相似度为余弦相似度。
第二方面,提供了一种文档处理的装置,包括执行上述第一方面或其任意可能的实现方式中的方法的模块。
第三方面,提供了一种文档处理的装置,包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行上述文档处理的方法。
第四方面,本申请还提供了一种计算机可读存储介质,存储用于设备执行的程序代码,所述程序代码包括用于执行上述文档处理的方法中的步骤的指令。
第五方面,本申请还提供了一种计算机程序产品,所述计算机程序产品 包括存储在计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行上述的文档处理的方法。
附图说明
图1是本申请实施例的一种系统架构的示意图。
图2是本申请实施例的文档处理的过程的示意图。
图3是本申请实施例的文档处理的方法的示意性流程图。
图4是本申请一个实施例的文档处理的装置的示意性框图。
图5是本申请另一个实施例的文档处理的装置的示意性框图。
图6是本申请实施例的文档处理的装置的结构示意图。
附图标记列表:
110,处理装置;
111,处理模块;
112,通信接口;
113,存储系统;
114,预处理模块;
115,模型;
120,数据生成设备;
130,训练设备;
140,数据库;
150,语料库;
201,文档;
202,分词工具;
203,通用词典;
204,初始词序列;
205,专业词典;
206,逆向最大匹配;
207,词;
208,句子嵌入模型;
209,句子向量;
210,文档表征;
211,基于文档表征的处理;
310,根据第i个文档的句子与M个文档的N个句子间的相似度,确定所述N个句子中第q个句子在所述第i个文档中对应的频率x i,q,其中,M和N均为大于1的整数,q为不大于N的正整数,i为正整数;
320,根据所述第q个句子在所述第i个文档中对应的频率x i,q,确定所述第i个文档的文档表征;
400,文档处理的装置;
410,获取单元;
420,文档表征单元;
430,句子表征单元;
440,分词单元;
450,处理单元;
600,文档处理的装置;
601,存储器;
602,处理器;
603,通信接口;
604,总线。
具体实施方式
下面结合附图,对本申请实施例中的技术方案进行描述。应理解,本说明书中的具体的例子只是为了帮助本领域技术人员更好地理解本申请实施例,而非限制本申请实施例的范围。
应理解,在本申请的各种实施例中,各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
还应理解,本说明书中描述的各种实施方式,既可以单独实施,也可以组合实施,本申请实施例对此不作限定。
除非另有说明,本申请实施例所使用的所有技术和科学术语与本申请的技术领域的技术人员通常理解的含义相同。本申请中所使用的术语只是为了描述具体的实施例的目的,不是旨在限制本申请的范围。
图1是本申请实施例的一种系统架构的示意图。
在图1所示的系统架构中,数据生成设备120为产生待处理数据的设备,例如,产生待处理文档的设备。数据生成设备120可以有多个。数据生成设备120产生的文档可以直接传输给处理装置110,也可以先存储到数据库140中,再由处理装置110从数据库140中获取。以故障描述文档为例,数据生成设备120新产生的故障描述文档可以实时传输给处理装置110;而处理装置110从数据库140中获取历史文档。
处理装置110与数据生成设备120通信连接。具体地,处理装置110可以包括通信接口112,以实现与其他设备的通信连接。该通信连接可以是有线方式,也可以是无线方式。
处理装置110可以是具有数据处理能力的电子设备或系统,例如计算机。处理装置110可以包括处理模块111,用于实现数据的处理,例如,采用本申请实施例的技术方案进行文档表征。处理模块111具体可以为一个或多个处理器。处理器可以为任意种类的处理器,本申请实施例对此不作限定。
处理装置110还可以包括存储系统113。存储系统113可用于存储数据和指令,例如,实现本申请实施例的技术方案的计算机可执行指令。处理装置110可以调用存储系统113中的数据、指令等,也可以将数据、指令等存入存储系统113中。存储系统113具体可以为一个或多个存储器。该存储器可以为任意种类的存储器,本申请实施例对此也不作限定。
存储系统113可以设置于处理装置110内,也可以设置于处理装置110外。在存储系统113设置于处理装置110外的情况下,处理装置110可通过数据接口实现对存储系统113的访问。
处理装置110还可以包括其他通用的设备,例如,输出设备,用于向用户输出处理后的数据。
在一些可能的实现方式中,处理模块111可包括预处理模块114,用于对获取的数据进行预处理。例如,对文档进行分词处理。
在一些可能的实现方式中,处理装置110中可以配置训练后的模型115。 在这种情况下,处理模块111可以采用模型115进行相应的处理。
例如,模型115可以为用于句子表征的句子嵌入模型。训练设备130可以基于语料库150中的训练数据训练得到句子嵌入模型。这样,处理模块111可以采用该句子嵌入模型得到句子的表征。
在一些可能的实现方式中,对于输入的文档,可先通过预处理模块114进行分词处理,得到句子的词;再将其输入模型115,得到句子的表征;然后再采用下述的本申请实施例的技术方案进行文档表征。
在一些可能的实现方式中,处理装置110可以同时为数据生成设备。在这种情况下,处理装置110可以产生待处理文档,并处理该文档。
在一些可能的实现方式中,处理装置110可以同时为训练设备。在这种情况下,处理装置110可先训练模型115,再使用模型115,还可以在使用的过程中同时训练模型115。
应理解,图1仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制。
在一些可能的实现方式中,训练设备130训练得到模型115,可以是基于深度学习得到的模型,例如,可以是基于神经网络搭建的模型,这里的神经网络可以是卷积神经网络(convolutional neural networks,CNN)、循环神经网络(recurrent neural network,RNN)、深度卷积神经网络(deep convolutional neural networks,DCNN)等等。
下面结合图2,对本申请实施例的文档处理的方法的主要过程进行简单的介绍。
图2示出了本申请实施例的文档处理的过程的示意图。
在图2中,对于待处理的文档201,可以先通过分词工具202进行分词,得到初始词序列203;然后再通过逆向最大匹配206(inverse maximum matching)得到处理后的词207。上述过程可以成为预处理过程。由于工业领域的文档,会有较多的专用词语,因此可采用从外部数据源得到的通用词典和客户侧配置的文档所属领域的专业词典进行逆向最大匹配。预处理后的词207可以通过句子嵌入模型208得到句子向量209(句子表征)。可选地,句子嵌入模型208可以为预先训练的模型。关于句子嵌入模型将会在下文详细描述。
基于句子向量209,可以进行文档表征,得到文档表征210,文档表征的详细方案将会在下文描述。最后,可以基于文档表征210,进行后续的处理211,例如,文档分析、比较、分类、训练分类器等。
关于文档表征,一种常用方法是词频-逆文档频率(Term Frequency-Inverse Document Frequency,TFIDF)方法。TFIDF方法用以评估一个词对于一个文档集或一个语料库中的其中一份文档的重要程度。词的重要性随着它在文档中出现的次数成正比增加,但同时会随着它在语料库中对应的频率成反比下降。如果某个词在一篇文档中对应的频率高,并且在其他文档中很少出现,则认为此词具有很好的类别区分能力,适合用来分类。TFIDF实际上是:TF*IDF,其中,TF为词频(Term Frequency),IDF为逆向文档频率(Inverse Document Frequency)。TF表示词在文档中对应的频率。IDF的主要思想是:如果包含该词的文档越少,则IDF越大,说明该词具有很好的类别区分能力。具体地,某一词的IDF,可以由总文档数目除以包含该词的文档的数目,再将得到的商取以10为底的对数得到。
TFIDF方法采用词的TFIDF值形成文档的表征(向量)。因此,TFIDF方法中向量的维度由词的数量决定。然而,由于文档中词的数量比较大,该方法容易遭遇维度灾难的问题,影响了文档表征的效率。
有鉴于此,本申请实施例提供了一种改进的技术方案,基于句子的相似度,得到句子的″TFIDF″。区别于上述词的TFIDF,本申请实施例的技术方案可以称为伪TFIDF文档表征方法。下面对本申请实施例的技术方案进行具体描述。
图3示出了本申请实施例的文档处理的方法的示意性流程图。该方法可以由图1中的处理装置110执行。
310,根据第i个文档的句子与M个文档的N个句子间的相似度,确定所述N个句子中第q个句子在所述第i个文档中对应的频率x i,q,其中,M和N均为大于1的整数,q为不大于N的正整数,i为正整数。
在一个实施例中,当i不大于M时,所述第i个文档为所述M个文档中的文档。
在另一个实施例中,当i大于M时,所述第i个文档为所述M个文档外的文档。
在进行文档表征时,可在首次表征时对多个文档一同进行表征;后续出现新的文档时,可只需基于已表征过的文档对新的文档进行表征(例如,新的文档较少的情况),也可以将新的文档和已表征过的文档重新一同进行表征(例如,新的文档较多的情况)。
相应地,本申请实施例的技术方案包括两种情况下的技术方案,一种情况为对M个文档一同进行表征,即i不大于M时,对M个文档中的第i个文档进行表征;另一种情况为基于已表征过的M个文档对新的文档进行表征,即i大于M时,对M个文档外的文档进行表征。
上述所涉及的文档可以为特定领域的一类文档。例如,对于工业领域的故障处理系统,其可以为同一类设备的故障描述文档。该故障描述文档中可以包括故障的事件描述,对于已处理的故障,还可以包括故障的原因、解决方案或者分类等。在文档表征时,可只关注各文档都有的部分,如故障的事件描述。
例如,第i个故障描述文档可以表示为D i,或者,(D i,o i),其中D i表示故障的事件描述部分,o i表示故障对应的原因、解决方案或者分类等其他部分。对于未解决的故障,文档为前者;对于已解决的故障,文档包括上述两部分。在文档表征时,可只关注D i。在后续的应用中,可再根据D i获取对应的o i
应理解,本申请实施例的技术方案可以应用于各种文档,并不限定于上述举例。另外,本申请实施例对文档的语言种类也不限定,例如,可以是中文,也可以是非中文。在一些可选地实施例中,针对中文文档进一步给出了相应的技术方案。
在本申请实施例中,基于句子在文档中对应的频率进行文档表征,其中,句子在文档中对应的频率,通过句子间的相似度获取。也就是说,该频率并不是由句子在文档中真正出现的次数得到的,而是利用句子间的相似度得到的权重值。
可选地,所述第q个句子在所述第i个文档中对应的频率x i,q为所述第q个句子与所述第i个文档的每个句子的相似度的和。也就是说,一个句子在一个文档中对应的频率为该句子与该文档中所有句子的相似度的和。
可选地,所述相似度为余弦相似度,但本申请实施例对此并不限定。
余弦相似度,又称为余弦相似性,是通过计算两个向量的夹角余弦值来评估它们的相似度。对于文档来说,每个句子可以表征为一个多维向量,两个句子的余弦相似度即为表征这两个句子的两个向量的夹角余弦值。
可选地,在本申请实施例中,为了减少计算量,可以采用K最近邻(K-Nearest Neighbor,KNN)算法。也就是说,可以根据句子在句子空间(上述N个句子)中的最相似的K个句子,确定句子在文档中对应的频率,其中,K为小于N的正整数。
句子的最相似的K个句子,也就是句子空间中该句子的K个最近邻的句子,即,这K个句子与该句子的相似度最高。
在本申请实施例中,句子空间由M个文档的N个句子组成。在对M个文档一块进行表征时,所述第i个文档的句子为所述N个句子中的句子,采用其在所述N个句子中的最相似的K个句子确定频率x i,q。在对M个文档外的新文档进行表征时,所述第i个文档的句子不是所述N个句子中的句子,但是仍采用其在所述N个句子中的最相似的K个句子确定频率x i,q
可选地,K的取值可以基于句子的总数预先确定,还可以不断调整。例如,K可以与N正相关,即,N较大时,K可以取较大的值。又例如,若K个句子中已经有相似度比较低的句子,则可以减小K的值,反之,则可以增大K的值。作为示例,K的取值范围可以为5-30,但本申请实施例对此并不限定。
一个句子与一个文档中每个句子的相似度可以采用KNN算法设置。具体而言,若该句子属于该文档的某个句子的最相似的K个句子,则相似度可采用这两个句子的相似度,否则可以设置为零。
为了便于描述,将采用KNN算法后的相似度表示为K最近邻相似度。相应地,若所述第q个句子属于所述第i个文档的第j个句子的最相似的K个句子,则所述第q个句子与所述第i个文档的第j句子的K最近邻相似度为所述第q个句子与所述第i个文档的第j句子的相似度;若所述第q个句子不属于所述第i个文档的第j个句子的最相似的K个句子,则所述第q个句子与所述第i个文档的第j句子的K最近邻相似度为零,j为不大于n i的正整数,n i为所述第i个文档的句子的数量。
在这种情况下,所述第q个句子在所述第i个文档中对应的频率x i,q为所 述第q个句子与所述第i个文档的每个句子的K最近邻相似度的和。
可选地,可以采用以下方式确定句子在文档中对应的频率。
对于第i个文档的第j个句子,表示为第
Figure PCTCN2019129423-appb-000010
个句子,获取第
Figure PCTCN2019129423-appb-000011
个句子的最相似的K个句子;
引入N维向量
Figure PCTCN2019129423-appb-000012
,其中,若所述N个句子中的第l个句子属于第
Figure PCTCN2019129423-appb-000013
个句子的最相似的K个句子,则
Figure PCTCN2019129423-appb-000014
的第l个元素为所述第l个句子与所述第
Figure PCTCN2019129423-appb-000015
个句子的相似度(或者第l个元素为1),否则,
Figure PCTCN2019129423-appb-000016
的第l个元素为零;
对于第i个文档,
Figure PCTCN2019129423-appb-000017
N维向量x i的第q个元素x i,q,为所述N个句子中的第q个句子在所述第i个文档中对应的频率。n i为所述第i个文档的句子的数量,j为不大于n i的正整数,l为不大于N的正整数。
在公式1中,对第i个文档的n i个句子进行了求和,这样可以将n i个句子的相似度信息联合起来,从而能够体现文档的结构化信息。
320,根据所述第q个句子在所述第i个文档中对应的频率x i,q,确定所述第i个文档的文档表征。
通过310中的方式,可以得到所述第q个句子在所述M个文档中每个文档中对应的频率。这样可以进一步根据所述第q个句子在所述M个文档中对应的频率,确定所述第q个句子对应的逆文档频率idf q
也就是说,在基于相似度得到句子在文档中对应的频率后,接下来可以获取句子对应的逆文档频率。
所述第q个句子对应的逆文档频率idf q可以与所述M个文档中包含所述第q个句子的文档的数量负相关。也就是说,所述M个文档中包含所述第q个句子的文档的数量越大,所述第q个句子对应的逆文档频率idf q越小;所述M个文档中包含所述第q个句子的文档的数量越小,所述第q个句子对应的逆文档频率idf q越大。这样,采用该逆文档频率idf q可以体现句子的区分性。
在一个实施例中,若所述第q个句子在所述M个文档中的第p个文档中对应的频率大于零,则所述第p个文档包含所述第q个句子,p不大于M的正整数。也就是说,文档是否包含一个句子,可以采用该句子在该文档中 对应的频率是否大于零来判断。
可选地,在公式(1)的基础上,可以根据以下公式确定所述第q个句子对应的逆文档频率idf q
Figure PCTCN2019129423-appb-000018
其中,|*|表示集合的基数。
在公式(2)中,x i,q>0,表示第q个句子在i个文档中出现过。相应地,公式(2)表明,若第q个句子在越多文档中出现过,则逆文档频率idf q越小;若第q个句子在越少文档中出现过,则逆文档频率idf q越大。因此,逆文档频率idf q能够体现句子的区分性。
得到所述第q个句子在所述第i个文档中对应的频率x i,q和所述第q个句子对应的逆文档频率idf q后,接下来可以根据所述第q个句子在所述第i个文档中对应的频率x i,q和所述第q个句子对应的逆文档频率idf q,确定所述第i个文档的文档表征。
由上可知,句子对应的逆文档频率越大,句子的区分性越大。因此,可以将句子对应的逆文档频率与句子在文档中对应的频率联合起来表示句子在文档中的权重,进而得到文档的表征。
可选地,可以将句子在文档中对应的频率与句子对应的逆文档频率的乘积作为句子在文档中的权重。
例如,在公式(1)和(2)的基础上,第q个句子在第i个文档中的权重可以为,
y i,q=x i,q*idf q         (3)
y i,q作为N个句子对应的N维向量y i的第q个元素。对y i进行归一化可得到第i个文档的文档表征z i
Figure PCTCN2019129423-appb-000019
其中,||*||表示2范数。
通过上述方式,可以得到第i个文档的N维向量表征。也就是说,本申请实施例的技术方案得到的文档表征的维度为句子的数量。相对于词数量维度的表征方式,本申请实施例的技术方案的数据维度降低,从而能够降低数据计算量,提高文档表征的效率。
作为一个示例,在对M个文档一块表征时(即i不大于M),文档表征可以采用如下算法:
Algorithm(算法):Pseudo TFIDF Document Representation(伪TFIDF文档表征)
Input(输入):
sentence vectors,
Figure PCTCN2019129423-appb-000020
or
Figure PCTCN2019129423-appb-000021
if o i exist(文档
Figure PCTCN2019129423-appb-000022
或者,如果o i存在,
Figure PCTCN2019129423-appb-000023
的句子向量);
an integer K of KNN(KNN的整数K);
Output(输出):
representation feature matrix z for all documents(关于所有文档的表征特征矩阵z);
Comment(注释):
D i:the ith document(第i个文档);
Figure PCTCN2019129423-appb-000024
:the jth sentence for D i(D i的第j个句子)
Figure PCTCN2019129423-appb-000025
M:number of documents;(文档的数量);
n i:number of sentences in ith document(第i个文档的句子的数量);
N:number of sentences(句子的数量)
Figure PCTCN2019129423-appb-000026
Process(流程):
step1(步骤1):Sequentially assign an index number index
Figure PCTCN2019129423-appb-000027
from 1 to N for every sentence(为每个句子从1到N中依次顺序分配索引
Figure PCTCN2019129423-appb-000028
);
step2(步骤2):for every
Figure PCTCN2019129423-appb-000029
do(对于每一个
Figure PCTCN2019129423-appb-000030
):
Find the K nearest neighbors in the sentence space via distance metric,such as cosine similarity(通过距离矩阵,例如,余弦相似度,在句子空间中查找K个最近邻);
Involve a vector
Figure PCTCN2019129423-appb-000031
to represent this sentence,and the lth entry in
Figure PCTCN2019129423-appb-000032
is cosine similarity if index(s)==l and s is in the K nearest neighbors for
Figure PCTCN2019129423-appb-000033
(引入向量
Figure PCTCN2019129423-appb-000034
表示该句子,如果索引s==l且s在
Figure PCTCN2019129423-appb-000035
的K个最近邻中,则
Figure PCTCN2019129423-appb-000036
的第l个元素为余弦相似度);
step3(步骤3):for every D i do(对于每个D i):
The spare representation for D i is(冗余表征为)
Figure PCTCN2019129423-appb-000037
step4(步骤4):Let x i,q represent the qth component in x i(x i,q表示x i的第q个元素);So we can define(因此可以定义)
Figure PCTCN2019129423-appb-000038
Step5(步骤5):for every D i do(对于每个D i)
update y i by(更新y i)y i,q=x i,q*idf q
normalize(归一化)
Figure PCTCN2019129423-appb-000039
step6(步骤6):Based on
Figure PCTCN2019129423-appb-000040
do the clustering task(基于
Figure PCTCN2019129423-appb-000041
进行分类任务);or train a classifier for classification task using
Figure PCTCN2019129423-appb-000042
(或者使用
Figure PCTCN2019129423-appb-000043
训练用于分类任务的分类器).
例如,故障描述文档D包括三个句子s1,s2,s3,相应的句子向量为v1,v2,v3。通过KNN,s1,s2,s3最相似的K(为了容易理解,设为3)个句子为{s1,s}4s{,s6}s,{s2,}。假设相应的相似度为{1,0.8,0.75},{1,0.90,0.88},{1,0.95,0.82},根据公式(1)可以得到文档D的伪TF表征:[1,1,1,0.8,0,2.47,0.95,0,0,0.88.....0,0,0,0,0] N。根据上述算法,可以进一步得到伪TFIDF文档表征。
本申请实施例的技术方案,通过基于相似度得到的句子在文档中对应的频率和句子对应的逆文档频率进行文档表征,获得的文档表征能够反映文档的结构化信息和文档间的结构化信息,而且文档表征具有稀疏性,因而能够有效的表征文档。
另外,相对于目前的TFIDF表征方案,本申请实施例的伪TFIDF表征方案中的维度大大降低,因此,不需要太多计算和存储资源就能够获得较快的文档处理速度,从而能够提升文档处理的效率。
此外,相对于基于深度学习的文档表征方案,本申请实施例的文档表征方案不需要大量的训练数据,从而可以避免由于工业文档的保密性而导致的缺少训练数据的问题。
可选地,对于每个文档,可以通过各种方式得到每个句子的表征,即每个句子的向量,再采用上述方式得到文档的表征。
可选地,作为本申请的一个实施例,可以根据第一模型(句子嵌入模型), 获取句子的句子表征。
所述句子嵌入模型可以为预先训练的模型。例如,可以通过深度学习的方法预先训练句子嵌入模型。
可选地,也可以不训练句子嵌入模型,而是直接使用基于多个词嵌入模型的句子嵌入模型。
可选地,作为本申请的一个实施例,可以基于多个词嵌入模型得到所述第一模型。
具体而言,最简单的求句子嵌入(sentence embedding)的方法是对句子里所有的词的嵌入(word embedding)求平均,但是效果较差。为了提高效果,可将对word embedding求平均的操作泛化为P平均(p-mean)的一类操作,进而可以使用不同的p值产生不同的特征。
对u i进行p-mean可以表示为:
Figure PCTCN2019129423-appb-000044
当p=1时,p-mean就是取平均的操作;当p=+∞时,它是取最大值(max)的操作,当p=-∞时,它是取最小值(min)的操作。可以将以上三种操作(平均、最大和最小)放在一起使用提升效果。
另外,为了进一步提升效果,可以采用多种词嵌入模型,例如,word2vec,fasttext,glove等,每一种模型的embedding(嵌入,即向量)上都进行p-mean操作,再把结果连接起来。
例如,可以采用如下公式进行句子表征(句子嵌入),
Figure PCTCN2019129423-appb-000045
p∈{1,±∞},(u,v,w)∈(word2vec,fasttext,glove)
Figure PCTCN2019129423-appb-000046
上式中,u,v,w表示从不同词嵌入模型得到的词向量,s p表示对于p∈{1,±∞}的句子向量,n表示句子中的词的数量,通过符号
Figure PCTCN2019129423-appb-000047
连接不同的s p作为联合的句子向量。
举例来说,假设每个词嵌入模型的维度为100,根据公式(6),s 1,s +∞,s -∞的维度均为300,根据公式(7),通过向量连接可以得到900维的向量s。
基于上述的多个词嵌入模型,可以得到句子嵌入模型。根据所述多个词嵌入模型获取句子的词的多个向量,该句子嵌入模型可以对该多个向量进行连接,得到句子的句子表征(向量)。这样,在对文档进行表征时,可以根据该句子嵌入模型,得到每个句子的句子表征。然后,再根据每个句子的句子表征,确定句子间的相似度,进而采用前述的方式得到文档的表征。
不同的嵌入模型可以带来补偿信息,而且不同的P值也可以引入丰富的信息,因此,上述连接P平均的模型能够提高表征的效果。
对于某些语言,例如,中文,需要对句子进行分词处理,得到句子的词。分词处理的质量关乎到后续的处理。对于工业领域的文档,会有较多的专用词语,因此,通常的分词方法较难识别出这些词语,影响了分词的质量。
鉴于此,在本申请一个实施例中,可以采用如下方式对句子进行分词处理:
获取句子的初始词序列;根据通用词典和文档所属领域的专业词典,对所述初始词序列进行逆向最大匹配,得到句子中的通用词和所述领域专用的词。
具体而言,可以通过现有的分词工具,例如结巴分词(jieba),得到句子的初始词序列;再根据自动建立的词典对所述初始词序列进行逆向最大匹配,其中,所述自动建立的词典包括从外部数据源得到的通用词典和客户侧配置的文档所属领域的专业词典。经过上述逆向最大匹配后,可以得到符合通用领域和专业领域理解的分词,从而能够提高分词的质量。
通过上述方式得到的句子的词,可用于前述的句子表征以及文档表征,从而可以提高中文文档表征的效率。
可选地,在通过前述方法得到文档的文档表征后,后续可以根据文档的文档表征,进行文档处理,例如,文档分析、比较、分类、训练分类器等。
可选地,在本申请一个实施例中,对于故障描述文档,可以根据文档的文档表征,进行故障处理相关的文档处理。
具体而言,在第i个文档为未解决故障对应的故障描述文档时,可以根据文档表征,确定所述M个文档中与所述第i个文档相似度最高的参考文档,其中,所述参考文档为已解决故障对应的故障描述文档,所述参考文档对应的故障的解决方案用于处理所述第i个文档对应的故障。也就是说,对于新 发生的故障,可以基于其文档,找到相似度最高的历史文档,根据该历史文档对应的故障的解决方案处理新发生的故障。
采用本申请实施例的技术方案进行故障处理,可以有效地基于历史文档自动获得解决方案,无需人工进行处理,从而能够节省时间和人工成本,提高处理效率。
上文详细描述了本申请实施例的方法实施例,下面描述本申请实施例的装置实施例,装置实施例与方法实施例相互对应,因此未详细描述的部分可参见前面各方法实施例,装置可以实现上述方法中任意可能实现的方式。
图4示出了本申请一个实施例的文档处理的装置400的示意性框图。该装置400可以执行上述本申请实施例的文档处理的方法,例如,该装置400可以为前述处理装置110。
如图4所示,该装置400可以包括:
获取单元410,用于获取M个文档;
文档表征单元420,用于根据第i个文档的句子与所述M个文档的N个句子间的相似度,确定所述N个句子中第q个句子在所述第i个文档中对应的频率x i,q,其中,M和N均为大于1的整数,q为不大于N的正整数;根据所述第q个句子在所述第i个文档中对应的频率x i,q,确定所述第i个文档的文档表征。
可选地,在本申请一个实施例中,当i不大于M时,所述第i个文档为所述M个文档中的文档。
可选地,在本申请一个实施例中,当i大于M时,所述第i个文档为所述M个文档外的文档。
可选地,在本申请一个实施例中,所述文档表征单元420具体用于:
根据所述第q个句子在所述M个文档中对应的频率,确定所述第q个句子对应的逆文档频率idf q
根据所述第q个句子在所述第i个文档中对应的频率x i,q和所述第q个句子对应的逆文档频率idf q,确定所述第i个文档的文档表征。
可选地,在本申请一个实施例中,所述第q个句子对应的逆文档频率idf q与所述M个文档中包含所述第q个句子的文档的数量负相关。
可选地,在本申请一个实施例中,若所述第q个句子在所述M个文档中的第p个文档中对应的频率大于零,则所述第p个文档包含所述第q个句子,p不大于M的正整数。
可选地,在本申请一个实施例中,所述第q个句子在所述第i个文档中对应的频率x i,q为所述第q个句子与所述第i个文档的每个句子的相似度的和。
可选地,在本申请一个实施例中,所述文档表征单元420具体用于:
根据所述第i个文档的句子在所述N个句子中的最相似的K个句子,确定所述第q个句子在所述第i个文档中对应的频率x i,q,其中,K为小于N的正整数。
可选地,在本申请一个实施例中,K与N正相关。
可选地,在本申请一个实施例中,所述第q个句子在所述第i个文档中对应的频率x i,q为所述第q个句子与所述第i个文档的每个句子的K最近邻相似度的和;
其中,若所述第q个句子属于所述第i个文档的第j个句子的最相似的K个句子,则所述第q个句子与所述第i个文档的第j句子的K最近邻相似度为所述第q个句子与所述第i个文档的第j句子的相似度;
若所述第q个句子不属于所述第i个文档的第j个句子的最相似的K个句子,则所述第q个句子与所述第i个文档的第j句子的K最近邻相似度为零,j为不大于n i的正整数,n i为所述第i个文档的句子的数量。
可选地,在本申请一个实施例中,所述第q个句子在所述第i个文档中对应的频率x i,q为N维向量x i的第q个元素,
其中,
Figure PCTCN2019129423-appb-000048
其中,对于向量N维
Figure PCTCN2019129423-appb-000049
若所述N个句子中的第l个句子属于第
Figure PCTCN2019129423-appb-000050
个句子的最相似的K个句子,则
Figure PCTCN2019129423-appb-000051
的第l个元素为所述第l个句子与所述第
Figure PCTCN2019129423-appb-000052
个句子的相似度,否则,
Figure PCTCN2019129423-appb-000053
的第l个元素为零,其中,所述第
Figure PCTCN2019129423-appb-000054
个句子为所述第i个文档的第j个句子,n i为所述第i个文档的句子的数量,j为不大于n i的正整数,l为不大于N的正整数。
可选地,在本申请一个实施例中,所述文档表征单元420具体用于:
根据以下公式确定所述第q个句子对应的逆文档频率idf q
Figure PCTCN2019129423-appb-000055
其中,|*|表示集合的基数。
可选地,在本申请一个实施例中,所述文档表征单元420具体用于:
根据以下公式确定所述第i个文档的文档表征z i
Figure PCTCN2019129423-appb-000056
其中,||*||表示2范数,N维向量y i的第q个元素y i,q为,
y i,q=x i,q*idf q
可选地,如图5所示,所述装置400还可以包括:
句子表征单元430,用于根据第一模型,获取句子的句子表征;
其中,所述文档表征单元还用于根据句子的句子表征,确定句子间的相似度;
所述第一模型为基于多个词嵌入模型得到的,所述第一模型对多个向量进行连接,得到句子的句子表征,所述多个向量为根据所述多个词嵌入模型获取的句子的词的向量。
可选地,在本申请一个实施例中,所述文档为中文文档,如图5所示,所述装置400还可以包括:
分词单元440,用于对句子进行分词处理,得到句子中的通用词和所述领域专用的词。
可选地,在本申请一个实施例中,所述分词单元440具体用于:
获取句子的初始词序列;
根据通用词典和所述文档所属领域的专业词典,对所述初始词序列进行逆向最大匹配,得到句子的词。
可选地,在本申请一个实施例中,如图5所示,所述装置400还包括:
处理单元450,用于根据所述文档表征,进行文档处理。
可选地,在本申请一个实施例中,所述第i个文档为未解决故障对应的故障描述文档;
所述处理单元450具体用于:
根据所述文档表征,确定所述M个文档中与所述第i个文档相似度最高的参考文档,其中,所述参考文档为已解决故障对应的故障描述文档,所 述参考文档对应的故障的解决方案用于处理所述第i个文档对应的故障。
可选地,在本申请一个实施例中,所述相似度为余弦相似度。
图6是本申请实施例的文档处理的装置的硬件结构示意图。图6所示的文档处理的装置600包括存储器601、处理器602、通信接口603以及总线604。其中,存储器601、处理器602、通信接口603通过总线604实现彼此之间的通信连接。
存储器601可以是只读存储器(read-only memory,ROM),静态存储设备和随机存取存储器(random access memory,RAM)。存储器601可以存储程序,当存储器601中存储的程序被处理器602执行时,处理器602和通信接口603用于执行本申请实施例的文档处理的方法的各个步骤。
处理器602可以采用通用的中央处理器(central processing unit,CPU),微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请实施例的文档处理的装置中的单元所需执行的功能,或者执行本申请实施例的文档处理的方法。
处理器602还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请实施例的文档处理的方法的各个步骤可以通过处理器602中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器602还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、ASIC、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器601,处理器602读取存储器601中的信息,结合其硬件完成本申请实施例的文档处理的装置中包括的单元所需执行的功能,或者执行本申请实施例的文档处理的方法。
通信接口603使用例如但不限于收发器一类的收发装置,来实现装置600与其他设备或通信网络之间的通信。例如,可以通过通信接口603获取待表征文档。
总线604可包括在装置600各个部件(例如,存储器601、处理器602、通信接口603)之间传送信息的通路。
应注意,尽管上述装置600仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,装置600还可以包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置600还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置600也可仅仅包括实现本申请实施例所必须的器件,而不必包括图6中所示的全部器件。
本申请实施例还提供了一种计算机可读存储介质,存储用于设备执行的程序代码,所述程序代码包括用于执行上述文档处理的方法中的步骤的指令。
本申请实施例还提供了一种计算机程序产品,所述计算机程序产品包括存储在计算机可读存储介质上的计算机程序,所述计算机程序包括程序指令,当所述程序指令被计算机执行时,使所述计算机执行上述文档处理的方法。
上述的计算机可读存储介质可以是暂态计算机可读存储介质,也可以是非暂态计算机可读存储介质。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的装置的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
本申请中使用的用词仅用于描述实施例并且不用于限制权利要求。如在实施例以及权利要求的描述中使用的,除非上下文清楚地表明,否则单数形 式的“一个”和“所述”旨在同样包括复数形式。类似地,如在本申请中所使用的术语“和/或”是指包含一个或一个以上相关联的列出的任何以及所有可能的组合。另外,当用于本申请中时,术语“包括”指陈述的特征、整体、步骤、操作、元素,和/或组件的存在,但不排除一个或一个以上其它特征、整体、步骤、操作、元素、组件和/或这些的分组的存在或添加。
所描述的实施例中的各方面、实施方式、实现或特征能够单独使用或以任意组合的方式使用。所描述的实施例中的各方面可由软件、硬件或软硬件的结合实现。所描述的实施例也可以由存储有计算机可读代码的计算机可读介质体现,该计算机可读代码包括可由至少一个计算装置执行的指令。所述计算机可读介质可与任何能够存储数据的数据存储装置相关联,该数据可由计算机系统读取。用于举例的计算机可读介质可以包括只读存储器、随机存取存储器、紧凑型光盘只读储存器(Compact Disc Read-Only Memory,CD-ROM)、硬盘驱动器(Hard Disk Drive,HDD)、数字视频光盘(Digital Video Disc,DVD)、磁带以及光数据存储装置等。所述计算机可读介质还可以分布于通过网络联接的计算机系统中,这样计算机可读代码就可以分布式存储并执行。
上述技术描述可参照附图,这些附图形成了本申请的一部分,并且通过描述在附图中示出了依照所描述的实施例的实施方式。虽然这些实施例描述的足够详细以使本领域技术人员能够实现这些实施例,但这些实施例是非限制性的;这样就可以使用其它的实施例,并且在不脱离所描述的实施例的范围的情况下还可以做出变化。比如,流程图中所描述的操作顺序是非限制性的,因此在流程图中阐释并且根据流程图描述的两个或两个以上操作的顺序可以根据若干实施例进行改变。作为另一个例子,在若干实施例中,在流程图中阐释并且根据流程图描述的一个或一个以上操作是可选的,或是可删除的。另外,某些步骤或功能可以添加到所公开的实施例中,或两个以上的步骤顺序被置换。所有这些变化被认为包含在所公开的实施例以及权利要求中。
另外,上述技术描述中使用术语以提供所描述的实施例的透彻理解。然而,并不需要过于详细的细节以实现所描述的实施例。因此,实施例的上述描述是为了阐释和描述而呈现的。上述描述中所呈现的实施例以及根据这些实施例所公开的例子是单独提供的,以添加上下文并有助于理解所描述的实 施例。上述说明书不用于做到无遗漏或将所描述的实施例限制到本申请的精确形式。根据上述教导,若干修改、选择适用以及变化是可行的。在某些情况下,没有详细描述为人所熟知的处理步骤以避免不必要地影响所描述的实施例。
以上所述,仅为本申请实施例的具体实施方式,但本申请实施例的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请实施例揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请实施例的保护范围之内。因此,本申请实施例的保护范围应以所述权利要求的保护范围为准。

Claims (40)

  1. 一种文档处理的方法,其特征在于,包括:
    根据第i个文档的句子与M个文档的N个句子间的相似度,确定所述N个句子中第q个句子在所述第i个文档中对应的频率x i,q,其中,M和N均为大于1的整数,q为不大于N的正整数,i为正整数;
    根据所述第q个句子在所述第i个文档中对应的频率x i,q,确定所述第i个文档的文档表征。
  2. 根据权利要求1所述的方法,其特征在于,当i不大于M时,所述第i个文档为所述M个文档中的文档。
  3. 根据权利要求1所述的方法,其特征在于,当i大于M时,所述第i个文档为所述M个文档外的文档。
  4. 根据权利要求1至3中任一项所述的方法,其特征在于,所述确定所述第i个文档的文档表征,包括:
    根据所述第q个句子在所述M个文档中对应的频率,确定所述第q个句子对应的逆文档频率idf q
    根据所述第q个句子在所述第i个文档中对应的频率x i,q和所述第q个句子对应的逆文档频率idf q,确定所述第i个文档的文档表征。
  5. 根据权利要求4所述的方法,其特征在于,所述第q个句子对应的逆文档频率idf q与所述M个文档中包含所述第q个句子的文档的数量负相关。
  6. 根据权利要求5所述的方法,其特征在于,若所述第q个句子在所述M个文档中的第p个文档中对应的频率大于零,则所述第p个文档包含所述第q个句子,p不大于M的正整数。
  7. 根据权利要求1至6中任一项所述的方法,其特征在于,所述第q个句子在所述第i个文档中对应的频率x i,q为所述第q个句子与所述第i个文档的每个句子的相似度的和。
  8. 根据权利要求1至7中任一项所述的方法,其特征在于,所述确定所述N个句子中第q个句子在所述第i个文档中对应的频率x i,q,包括:
    根据所述第i个文档的句子在所述N个句子中的最相似的K个句子,确定所述第q个句子在所述第i个文档中对应的频率x i,q,其中,K为小于N 的正整数。
  9. 根据权利要求8所述的方法,其特征在于,K与N正相关。
  10. 根据权利要求8或9所述的方法,其特征在于,所述第q个句子在所述第i个文档中对应的频率x i,q为所述第q个句子与所述第i个文档的每个句子的K最近邻相似度的和;
    其中,若所述第q个句子属于所述第i个文档的第j个句子的最相似的K个句子,则所述第q个句子与所述第i个文档的第j句子的K最近邻相似度为所述第q个句子与所述第i个文档的第j句子的相似度;
    若所述第q个句子不属于所述第i个文档的第j个句子的最相似的K个句子,则所述第q个句子与所述第i个文档的第j句子的K最近邻相似度为零,j为不大于n i的正整数,n i为所述第i个文档的句子的数量。
  11. 根据权利要求8至10中任一项所述的方法,其特征在于,所述第q个句子在所述第i个文档中对应的频率x i,q为N维向量x i的第q个元素,
    其中,
    Figure PCTCN2019129423-appb-100001
    其中,对于向量N维
    Figure PCTCN2019129423-appb-100002
    若所述N个句子中的第l个句子属于第
    Figure PCTCN2019129423-appb-100003
    个句子的最相似的K个句子,则
    Figure PCTCN2019129423-appb-100004
    的第l个元素为所述第l个句子与所述第
    Figure PCTCN2019129423-appb-100005
    个句子的相似度,否则,
    Figure PCTCN2019129423-appb-100006
    的第l个元素为零,其中,所述第
    Figure PCTCN2019129423-appb-100007
    个句子为所述第i个文档的第j个句子,n i为所述第i个文档的句子的数量,j为不大于n i的正整数,l为不大于N的正整数。
  12. 根据权利要求11所述的方法,其特征在于,所述确定所述第q个句子对应的逆文档频率idf q,包括:
    根据以下公式确定所述第q个句子对应的逆文档频率idf q
    Figure PCTCN2019129423-appb-100008
    其中,|*|表示集合的基数。
  13. 根据权利要求12所述的方法,其特征在于,所述确定所述第i个文档的文档表征,包括:
    根据以下公式确定所述第i个文档的文档表征z i
    Figure PCTCN2019129423-appb-100009
    其中,||*||表示2范数,N维向量y i的第q个元素y i,q为,
    y i,q=x i,q*idf q
  14. 根据权利要求1至13中任一项所述的方法,其特征在于,所述方法还包括:
    根据第一模型,获取句子的句子表征;
    根据句子的句子表征,确定句子间的相似度;
    其中,所述第一模型为基于多个词嵌入模型得到的,所述第一模型对多个向量进行连接,得到句子的句子表征,所述多个向量为根据所述多个词嵌入模型获取的句子的词的向量。
  15. 根据权利要求14所述的方法,其特征在于,所述文档为中文文档,所述方法还包括:
    对句子进行分词处理,得到句子的词。
  16. 根据权利要求15所述的方法,其特征在于,所述对句子进行分词处理,包括:
    获取句子的初始词序列;
    根据通用词典和所述文档所属领域的专业词典,对所述初始词序列进行逆向最大匹配,得到句子中的通用词和所述领域专用的词。
  17. 根据权利要求1至16中任一项所述的方法,其特征在于,所述方法还包括:
    根据所述文档表征,进行文档处理。
  18. 根据权利要求17所述的方法,其特征在于,所述第i个文档为未解决故障对应的故障描述文档;
    所述根据所述文档表征,进行文档处理,包括:
    根据所述文档表征,确定所述M个文档中与所述第i个文档相似度最高的参考文档,其中,所述参考文档为已解决故障对应的故障描述文档,所述参考文档对应的故障的解决方案用于处理所述第i个文档对应的故障。
  19. 根据权利要求1至18中任一项所述的方法,其特征在于,所述相似度为余弦相似度。
  20. 一种文档处理的装置,其特征在于,包括:
    获取单元(410),用于获取M个文档;
    文档表征单元(420),用于根据第i个文档的句子与所述M个文档的N个句子间的相似度,确定所述N个句子中第q个句子在所述第i个文档中对应的频率x i,q,其中,M和N均为大于1的整数,q为不大于N的正整数,i为正整数;根据所述第q个句子在所述第i个文档中对应的频率x i,q,确定所述第i个文档的文档表征。
  21. 根据权利要求20所述的装置,其特征在于,当i不大于M时,所述第i个文档为所述M个文档中的文档。
  22. 根据权利要求20所述的装置,其特征在于,当i大于M时,所述第i个文档为所述M个文档外的文档。
  23. 根据权利要求20至22中任一项所述的装置,其特征在于,所述文档表征单元(420)具体用于:
    根据所述第q个句子在所述M个文档中对应的频率,确定所述第q个句子对应的逆文档频率idf q
    根据所述第q个句子在所述第i个文档中对应的频率x i,q和所述第q个句子对应的逆文档频率idf q,确定所述第i个文档的文档表征。
  24. 根据权利要求23所述的装置,其特征在于,所述第q个句子对应的逆文档频率idf q与所述M个文档中包含所述第q个句子的文档的数量负相关。
  25. 根据权利要求24所述的装置,其特征在于,若所述第q个句子在所述M个文档中的第p个文档中对应的频率大于零,则所述第p个文档包含所述第q个句子,p不大于M的正整数。
  26. 根据权利要求20至25中任一项所述的装置,其特征在于,所述第q个句子在所述第i个文档中对应的频率x i,q为所述第q个句子与所述第i个文档的每个句子的相似度的和。
  27. 根据权利要求20至26中任一项所述的装置,其特征在于,所述文档表征单元(420)具体用于:
    根据所述第i个文档的句子在所述N个句子中的最相似的K个句子,确定所述第q个句子在所述第i个文档中对应的频率x i,q,其中,K为小于N的正整数。
  28. 根据权利要求27所述的装置,其特征在于,K与N正相关。
  29. 根据权利要求27或28所述的装置,其特征在于,所述第q个句子在所述第i个文档中对应的频率x i,q为所述第q个句子与所述第i个文档的每个句子的K最近邻相似度的和;
    其中,若所述第q个句子属于所述第i个文档的第j个句子的最相似的K个句子,则所述第q个句子与所述第i个文档的第j句子的K最近邻相似度为所述第q个句子与所述第i个文档的第j句子的相似度;
    若所述第q个句子不属于所述第i个文档的第j个句子的最相似的K个句子,则所述第q个句子与所述第i个文档的第j句子的K最近邻相似度为零,j为不大于n i的正整数,n i为所述第i个文档的句子的数量。
  30. 根据权利要求27至29中任一项所述的装置,其特征在于,所述第q个句子在所述第i个文档中对应的频率x i,q为N维向量x i的第q个元素,
    其中,
    Figure PCTCN2019129423-appb-100010
    其中,对于向量N维
    Figure PCTCN2019129423-appb-100011
    若所述N个句子中的第l个句子属于第
    Figure PCTCN2019129423-appb-100012
    个句子的最相似的K个句子,则
    Figure PCTCN2019129423-appb-100013
    的第l个元素为所述第l个句子与所述第
    Figure PCTCN2019129423-appb-100014
    个句子的相似度,否则,
    Figure PCTCN2019129423-appb-100015
    的第l个元素为零,其中,所述第
    Figure PCTCN2019129423-appb-100016
    个句子为所述第i个文档的第j个句子,n i为所述第i个文档的句子的数量,j为不大于n i的正整数,l为不大于N的正整数。
  31. 根据权利要求30所述的装置,其特征在于,所述文档表征单元(420)具体用于:
    根据以下公式确定所述第q个句子对应的逆文档频率idf q
    Figure PCTCN2019129423-appb-100017
    其中,|*|表示集合的基数。
  32. 根据权利要求31所述的装置,其特征在于,所述文档表征单元(420)具体用于:
    根据以下公式确定所述第i个文档的文档表征z i
    Figure PCTCN2019129423-appb-100018
    其中,||*||表示2范数,N维向量y i的第q个元素y i,q为,
    y i,q=x i,q*idf q
  33. 根据权利要求20至32中任一项所述的装置,其特征在于,所述装置还包括:
    句子表征单元(430),用于根据第一模型,获取句子的句子表征;
    其中,所述文档表征单元(420)还用于根据句子的句子表征,确定句子间的相似度;
    所述第一模型为基于多个词嵌入模型得到的,所述第一模型对多个向量进行连接,得到句子的句子表征,所述多个向量为根据所述多个词嵌入模型获取的句子的词的向量。
  34. 根据权利要求33所述的装置,其特征在于,所述文档为中文文档,所述装置还包括:
    分词单元(440),用于对句子进行分词处理,得到句子的词。
  35. 根据权利要求34所述的装置,其特征在于,所述分词单元(440)具体用于:
    获取句子的初始词序列;
    根据通用词典和所述文档所属领域的专业词典,对所述初始词序列进行逆向最大匹配,得到句子中的通用词和所述领域专用的词。
  36. 根据权利要求20至34中任一项所述的装置,其特征在于,所述装置还包括:
    处理单元(450),用于根据所述文档表征,进行文档处理。
  37. 根据权利要求36所述的装置,其特征在于,所述第i个文档为未解决故障对应的故障描述文档;
    所述处理单元(450)具体用于:
    根据所述文档表征,确定所述M个文档中与所述第i个文档相似度最高的参考文档,其中,所述参考文档为已解决故障对应的故障描述文档,所述参考文档对应的故障的解决方案用于处理所述第i个文档对应的故障。
  38. 根据权利要求20至37中任一项所述的装置,其特征在于,所述相似度为余弦相似度。
  39. 一种文档处理的装置,其特征在于,包括:
    存储器(601),用于存储程序;
    处理器(602),用于执行所述存储器(601)存储的程序,当所述存储 器(601)存储的程序被执行时,所述处理器(602)用于执行根据权利要求1至19中任一项所述的文档处理的方法。
  40. 一种计算机可读存储介质,其特征在于,所述计算机可读介质存储用于设备执行的程序代码,所述程序代码包括用于执行根据权利要求1至19中任一项所述的文档处理的方法中的步骤的指令。
PCT/CN2019/129423 2019-12-27 2019-12-27 文档处理的方法和装置 WO2021128342A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201980102556.2A CN114746855A (zh) 2019-12-27 2019-12-27 文档处理的方法和装置
PCT/CN2019/129423 WO2021128342A1 (zh) 2019-12-27 2019-12-27 文档处理的方法和装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/129423 WO2021128342A1 (zh) 2019-12-27 2019-12-27 文档处理的方法和装置

Publications (1)

Publication Number Publication Date
WO2021128342A1 true WO2021128342A1 (zh) 2021-07-01

Family

ID=76575088

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/129423 WO2021128342A1 (zh) 2019-12-27 2019-12-27 文档处理的方法和装置

Country Status (2)

Country Link
CN (1) CN114746855A (zh)
WO (1) WO2021128342A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115329742A (zh) * 2022-10-13 2022-11-11 深圳市大数据研究院 基于文本分析的科研项目产出评价验收方法及系统

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107544956A (zh) * 2016-06-24 2018-01-05 科大讯飞股份有限公司 一种文本要点检测方法及系统
US20180144234A1 (en) * 2016-11-20 2018-05-24 Arturo Devesa Sentence Embedding for Sequence-To-Sequence Matching in a Question-Answer System
CN108628825A (zh) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 文本信息相似度匹配方法、装置、计算机设备及存储介质
US20180293499A1 (en) * 2017-04-11 2018-10-11 Sap Se Unsupervised neural attention model for aspect extraction
CN109783794A (zh) * 2017-11-14 2019-05-21 北大方正集团有限公司 文本分类方法及装置
CN110413986A (zh) * 2019-04-12 2019-11-05 上海晏鼠计算机技术股份有限公司 一种改进词向量模型的文本聚类多文档自动摘要方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107544956A (zh) * 2016-06-24 2018-01-05 科大讯飞股份有限公司 一种文本要点检测方法及系统
US20180144234A1 (en) * 2016-11-20 2018-05-24 Arturo Devesa Sentence Embedding for Sequence-To-Sequence Matching in a Question-Answer System
US20180293499A1 (en) * 2017-04-11 2018-10-11 Sap Se Unsupervised neural attention model for aspect extraction
CN109783794A (zh) * 2017-11-14 2019-05-21 北大方正集团有限公司 文本分类方法及装置
CN108628825A (zh) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 文本信息相似度匹配方法、装置、计算机设备及存储介质
CN110413986A (zh) * 2019-04-12 2019-11-05 上海晏鼠计算机技术股份有限公司 一种改进词向量模型的文本聚类多文档自动摘要方法及系统

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115329742A (zh) * 2022-10-13 2022-11-11 深圳市大数据研究院 基于文本分析的科研项目产出评价验收方法及系统
CN115329742B (zh) * 2022-10-13 2023-02-03 深圳市大数据研究院 基于文本分析的科研项目产出评价验收方法及系统

Also Published As

Publication number Publication date
CN114746855A (zh) 2022-07-12

Similar Documents

Publication Publication Date Title
CN108399228B (zh) 文章分类方法、装置、计算机设备及存储介质
US10586155B2 (en) Clarification of submitted questions in a question and answer system
US9318027B2 (en) Caching natural language questions and results in a question and answer system
WO2021056710A1 (zh) 多轮问答识别方法、装置、计算机设备及存储介质
CN106844518B (zh) 一种基于子空间学习的不完整跨模态检索方法
CN110569289B (zh) 基于大数据的列数据处理方法、设备及介质
WO2021051864A1 (zh) 词典扩充方法及装置、电子设备、存储介质
CN112395875A (zh) 一种关键词提取方法、装置、终端以及存储介质
US20210209482A1 (en) Method and apparatus for verifying accuracy of judgment result, electronic device and medium
CN112287656B (zh) 文本比对方法、装置、设备和存储介质
US20240111956A1 (en) Nested named entity recognition method based on part-of-speech awareness, device and storage medium therefor
US20230065965A1 (en) Text processing method and apparatus
CN114782722B (zh) 图文相似度的确定方法、装置及电子设备
CN112434134B (zh) 搜索模型训练方法、装置、终端设备及存储介质
CN111177375A (zh) 一种电子文档分类方法及装置
WO2024001104A1 (zh) 一种图文数据互检方法、装置、设备及可读存储介质
US20240211698A1 (en) Embedding inference
WO2022116444A1 (zh) 文本分类方法、装置、计算机设备和介质
CN114202443A (zh) 政策分类方法、装置、设备及存储介质
WO2024159819A1 (zh) 训练方法、版面分析、质量评估方法、装置、设备和介质
CN112529743B (zh) 合同要素抽取方法、装置、电子设备及介质
WO2021128342A1 (zh) 文档处理的方法和装置
CN118132512A (zh) 用于文档查询的方法、装置、设备和可读介质
CN117744652A (zh) 一种基于大语言模型的领域特征词挖掘方法和装置
CN115795355B (zh) 一种分类模型训练方法、装置及设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19957333

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19957333

Country of ref document: EP

Kind code of ref document: A1