CN109446318A - A kind of method and relevant device of determining auto repair document subject matter - Google Patents

A kind of method and relevant device of determining auto repair document subject matter Download PDF

Info

Publication number
CN109446318A
CN109446318A CN201811075837.7A CN201811075837A CN109446318A CN 109446318 A CN109446318 A CN 109446318A CN 201811075837 A CN201811075837 A CN 201811075837A CN 109446318 A CN109446318 A CN 109446318A
Authority
CN
China
Prior art keywords
theme
document
probability
vocabulary
sorted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811075837.7A
Other languages
Chinese (zh)
Inventor
刘均
刘新
邓思超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Launch Technology Co Ltd
Original Assignee
Shenzhen Launch Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Launch Technology Co Ltd filed Critical Shenzhen Launch Technology Co Ltd
Priority to CN201811075837.7A priority Critical patent/CN109446318A/en
Publication of CN109446318A publication Critical patent/CN109446318A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

This application discloses the methods and relevant device of a kind of determining auto repair document subject matter, this method comprises: obtaining document to be sorted and maintenance theme;Obtain the feature word set of the document to be sorted;According to the feature word set and vocabulary probability calculation theme probability;According to the theme probability, determine whether the maintenance theme is used as the theme of the document to be sorted.Using the application, the theme of auto repair document can be accurately identified, improve the efficiency for distinguishing auto repair document, save the time of maintenance technician.

Description

A kind of method and relevant device of determining auto repair document subject matter
Technical field
This application involves field of computer technology more particularly to a kind of methods and correlation of determining auto repair document subject matter Equipment.
Background technique
During auto repair, a large amount of service document can be generated.These documents include many related to auto repair Information, these information are efficiently used, it is horizontal to can be improved Automobile Service Factory maintenance, increases the satisfaction of client.And It identifies theme described in auto repair document, corresponding suggestion and scheme can be provided for above-mentioned theme, gives maintenance process Bring beneficial effect.The type of above-mentioned theme includes vehicle, malfunctioning module, vehicle brand etc..Typically, since automobile zero is matched Part is various, system complex, would generally mention multiple vehicles, components or system, maintenance technician in a service document It needs to read over, could accurately judge theme described in these service documents.In this process, maintenance technician Consume plenty of time and energy.
The prior art searches for the key to match in auto repair document with some theme by key word matching method Word, and then judge the theme of the document.But this method level is fairly simple, can not accurately judge auto repair document Theme.
Summary of the invention
The application proposes the method and relevant device of a kind of determining auto repair document subject matter, is determined for automobile dimension It improves literature the themes of shelves, improves the efficiency for distinguishing auto repair document.
In a first aspect, a kind of method that the application proposes determining auto repair document subject matter, comprising:
Document to be sorted and maintenance theme are obtained, the document to be sorted includes auto repair information, the maintenance theme For theme relevant to auto repair;
The feature word set of the document to be sorted is obtained, the feature word set is the feature vocabulary of the document to be sorted Set;
According to the feature word set and vocabulary probability calculation theme probability, the vocabulary probability is Feature Words concentration Each feature vocabulary corresponds to institute in the probability for semantically expressing the maintenance theme, the theme probability for the document to be sorted State the probability of maintenance theme;
According to the theme probability, determine whether the maintenance theme is used as the theme of the document to be sorted.
With reference to first aspect, in one possible implementation, it is described obtain document to be sorted and maintenance theme it Afterwards, before described according to the feature word set and vocabulary probability calculation theme probability, further includes:
Training document collection, the set that the Training document integrates as Training document are obtained, the Training document is tieed up comprising automobile Repair information;
Obtain the training word set of the Training document collection;The trained word set is the feature vocabulary of the Training document collection Set;
The trained word set and the maintenance theme are inputted in implicit Di Li Cray distribution LDA model, institute's predicate is exported Remittance probability.
With reference to first aspect, in one possible implementation, described according to the feature word set and vocabulary probability Calculate theme probability, comprising:
The feature word set and the vocabulary probability are inputted in implicit Di Li Cray distribution LDA model, the master is exported Inscribe probability.
With reference to first aspect, in one possible implementation, according to the theme probability, the maintenance master is determined Whether topic is after the theme as the document to be sorted, further includes:
If it is determined that the maintenance theme, then be added to by maintenance theme of the maintenance theme as the document to be sorted In the title of the document to be sorted;Alternatively, by the document storage to be sorted to memory block corresponding to the maintenance theme Domain.
Second aspect, the embodiment of the present application provide a kind of equipment of determining auto repair document subject matter, comprising:
First acquisition unit, for obtaining document to be sorted and maintenance theme, the document to be sorted includes auto repair Information, the maintenance theme are theme relevant to auto repair;
Second acquisition unit, for obtaining the feature word set of the document to be sorted, the feature word set is described wait divide The set of the feature vocabulary of class document;
Theme probability calculation unit, for according to the feature word set and vocabulary probability calculation theme probability, institute's predicate The probability that converges is that the Feature Words concentrate each feature vocabulary in the probability for semantically expressing the maintenance theme, and the theme probability is The document to be sorted corresponds to the probability of the maintenance theme;
Determination unit, for determining whether the maintenance theme is used as the document to be sorted according to the theme probability Theme.
In conjunction with second aspect, in one possible implementation, the equipment, further includes:
Third acquiring unit, for obtaining Training document collection, the set that the Training document integrates as Training document, the instruction Practicing document includes auto repair information;
4th acquiring unit, for obtaining the training word set of the Training document collection, the trained word set is the training The set of the feature vocabulary of document sets;
Vocabulary probability calculation unit, for the trained word set and the maintenance theme to be inputted implicit Di Li Cray distribution In LDA model, the vocabulary probability is exported.
The embodiment of the present application third aspect discloses another equipment for determining auto repair document subject matter, including processing Device, memory, Database Unit, network interface, communication bus and user interface;Above-mentioned processor, above-mentioned memory, above-mentioned number It is connected with each other according to library unit, above-mentioned network interface and above-mentioned user interface by above-mentioned communication bus;Wherein, above-mentioned memory is used In storage computer program, above-mentioned computer program includes program instruction, and above-mentioned processor is configured for calling above procedure Instruction, the method for executing above-mentioned first aspect.
The embodiment of the present application fourth aspect discloses a kind of computer readable storage medium, above-mentioned computer-readable storage medium Matter is stored with computer program, and above-mentioned computer program includes program instruction, and above procedure instruction makes when being executed by a processor The method that above-mentioned processor executes above-mentioned first aspect.
Implement the embodiment of the present application to have the advantages that
In the embodiment of the present application, document to be sorted and maintenance theme are obtained, automobile document to be sorted is extracted Feature word set, and according to feature word set and vocabulary probability calculation theme probability, according to the above-mentioned maintenance theme of theme determine the probability It whether is the theme of service document of getting on the car.As it can be seen that can accurately identify auto repair text by realizing the embodiment of the present application The theme of shelves improves the efficiency for distinguishing auto repair document, saves the time of maintenance technician.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application or in background technique below will be implemented the application Attached drawing needed in example or background technique is illustrated.
Fig. 1 is a kind of flow diagram of the method for determining auto repair document subject matter provided by the embodiments of the present application;
Fig. 2 is a kind of flow diagram of method for calculating vocabulary probability provided by the embodiments of the present application;
Fig. 3 is a kind of structural schematic diagram of the equipment of determining auto repair document subject matter provided by the embodiments of the present application;
Fig. 4 is the structural schematic diagram of another equipment for determining auto repair document subject matter provided by the embodiments of the present application;
Fig. 5 is the structural schematic diagram of another equipment for determining auto repair document subject matter provided by the embodiments of the present application.
Specific embodiment
The description and claims of this application and term " first " in above-mentioned attached drawing, " second " etc. are for distinguishing Different objects, is not use to describe a particular order.In addition, term " includes " and " having " and their any deformations, meaning Figure, which is to cover, non-exclusive includes.Such as contain the process, method, system, product or equipment of a series of steps or units It is not limited to listed step or unit, but optionally further comprising the step of not listing or unit, or optionally also Including other step or units intrinsic for these process, methods or equipment.
The application proposes the method and relevant device of a kind of determining auto repair document subject matter, is determined for automobile dimension It improves literature the themes of shelves, improves the efficiency that maintenance technician distinguishes automobile document, save the time of maintenance technician.
The embodiment of the present application is described below in conjunction with attached drawing.
Fig. 1 is a kind of flow diagram of the method for determining auto repair document subject matter provided by the embodiments of the present application.On State method the following steps are included:
S101, document to be sorted and maintenance theme are obtained;
During auto repair, a large amount of service document can be generated.These documents include many related to auto repair Information, determine the theme of these documents to be sorted, these documents efficiently used, can be improved Automobile Service Factory maintenance Level improves maintenance efficiency.
In the embodiment of the present application, on the one hand, can receive by mantenance data receiving channel and come from user or maintenance The mantenance data of technician.For example, a mantenance data uploading channel is arranged in the application software of mobile phone, car owner can be by above-mentioned Channel uploads mantenance data.It is mobile eventually that above-mentioned mantenance data uploading channel can be placed in mobile phone, tablet computer, wearable device etc. In the application software at end, being also placed in is in application software in the PCs such as laptop, desktop computer.It is above-mentioned obtaining After mantenance data, above-mentioned mantenance data can be saved as to service document, the format of above-mentioned service document includes plain text document Format, PDF format, DOCX format etc..On the other hand, it can be obtained from the storage devices such as disk, CD, storage server The service document kept.
It should be noted that there are many different types for above-mentioned maintenance theme, it can be according to the actual demand of maintenance process To obtain above-mentioned maintenance theme.For example, can according to automobile brand obtain maintenance theme, above-mentioned maintenance theme can for It speeds, the automobile brands such as BMW or tesla;Alternatively, above-mentioned maintenance theme, above-mentioned maintenance theme can be obtained according to automobile component It can be the automobile components such as speed changer, clutch or engine;Alternatively, maintenance theme, above-mentioned vehicle can be obtained according to vehicle It can be the vehicles such as compact car, medium vehicle, three-box car or advanced vehicle.It in the embodiment of the present application, can be by same type of dimension It repairs theme and is saved in maintenance theme concentration, and be numbered, to carry out the training and prediction of topic model.For example, may be used It is that Z repairs theme collection, and is numbered according to 1~Z to obtain length according to automobile component.After obtaining maintenance theme collection, Above-mentioned maintenance theme collection can be read out in the matrix form and operation, alternatively, be read out in the form of storehouse set and Data processing and arithmetic speed, while the matching of theme easy to repair and service document are accelerated in operation.Determining document to be sorted Maintenance theme during, above-mentioned maintenance theme can be calculated, each maintenance theme is concentrated to correspond to the probability set of above-mentioned service document It closes, by comparing the size of each probability in above-mentioned Making by Probability Sets, determines the theme of above-mentioned document to be sorted.
S102, the feature word set for obtaining the document to be sorted;
In specific application scenarios, topic model or disaggregated model pass through the feature word set to service document to be sorted It is analyzed, with the theme of determination document to be sorted.In the embodiment of the present application, can by text segmentation methods to it is above-mentioned to point Class document is segmented, and deletes the stop words segmented in obtained word finder, obtains the feature word set of document to be sorted.? After obtaining feature word set, features described above word set can be numbered according to 1~N, wherein N is characterized the sum of vocabulary.
It should be noted that Feature Words, which are concentrated, to include identical feature vocabulary, in order to count corresponding to each theme The same feature vocabulary number, the dictionary of features described above word set can be constructed.Each feature vocabulary in above-mentioned dictionary only Occur once, the feature vocabulary in above-mentioned dictionary being numbered according to 1~V, wherein V is characterized feature vocabulary in dictionary Number.Wherein, N >=V.
In the embodiment of the present application, above-mentioned document to be sorted is segmented by text segmentation methods.Above-mentioned text point Word algorithm can be hidden Markov model (hidden markov model, HMM), Viterbi (viterbi) algorithm, condition Random field (conditional random field, CRF) model or maximum entropy (maximum entropy) model.Wherein, have Programming tool packet of many based on above-mentioned algorithm can load above-mentioned programming tool packet and carry out quickly in actual application Participle.
A kind of utilization text segmentation methods acquisition feature word set, and the method for constructing the dictionary of feature word set is set forth below, Implement step are as follows:
1) document format is converted;In order to facilitate text-processing is carried out, above-mentioned document can be uniformly converted into plain text format Document.
2) participle tool is called to be segmented;For example, the stammerer that crawler (python) programming language can be called to write (jieba) tool is segmented, above-mentioned participle tool is based on viterbi algorithm and hidden Markov model, supports syntype, accurate mould Formula, is based on word frequency-inverse document frequency (term frequency-inverse document at search engine mode Frequency, TF-IDF) algorithm keyword abstraction mode totally four kinds of participle modes;Various participle modes can be attempted respectively, The participle efficiency of above-mentioned various participle modes is calculated, and optimal participle mode is selected according to participle efficiency.
3) stop words is handled;After the participle for completing document to be sorted, word finder is obtained, includes mark in above-mentioned word finder The stop words such as point symbol, conjunction, modal particle, above-mentioned stop words do not have substantive significance to text classification;It can load deactivated Vocabulary, and above-mentioned stop words is deleted by key word matching method, obtain feature word set.
4) dictionary of feature word set is constructed;Dictionary, features described above vocabulary is added in feature vocabulary in features described above word set Only occur in dictionary primary;In the present embodiment, it can call Open-Source Tools that dictionary is added in above-mentioned vocabulary;For example, using Dictionary is added in feature vocabulary by Jesse's nurse (gensim) Open-Source Tools packet, obtains the dictionary of features described above word set.
S103, according to the feature word set and vocabulary probability calculation theme probability;
In the embodiment of the present application, above-mentioned theme can be calculated using LDA model according to feature word set and vocabulary probability Probability.Above-mentioned vocabulary probability is each feature vocabulary in features described above word set in the probability for semantically expressing above-mentioned maintenance theme;On Stating theme probability is the probability that above-mentioned document to be sorted corresponds to the maintenance theme.
The embodiment of the present application mainly passes through LDA model and calculates above-mentioned vocabulary probability.The meaning of LDA model is described below:
The method of traditional two document associations of judgement is the number by checking word that two documents occur jointly, Such as TF-IDF method.But this method does not consider the semantic association of text behind, it is possible to which two documents are common The word of appearance is seldom even without but two documents are being semantically associated.For example, it is assumed that there are two sentences, this two A sentence respectively indicates a document.First sentence is " Qiao Busi is from us ", and second sentence is " apple The price of mobile phone can or can not drop? ".Although there is no identical words for the two sentences, the two is being semantically associated , that is, belong to theme " Apple Inc. ".So obtaining the theme mould of the theme of document by being excavated to document semantic Type comes into being.Implicit Di Li Cray distribution (latent dirichlet allocation, LDA) model is a kind of three layers of shellfish This probabilistic model of leaf, includes word, theme and document three-decker, be it is a kind of commonly by the semanteme to document excavated into And obtain the topic model of document subject matter to be sorted.
LDA model defines a kind of document structure tree process, specific steps are as follows: a. is to each document, from theme distribution Extract a theme;B. a word is extracted from the distribution of word corresponding to the above-mentioned theme being pumped to;C. above-mentioned mistake is repeated Cheng Zhizhi traverses each of document word.As it can be seen that above-mentioned LDA model generates one by doc → theme → word process Document, wherein doc is document, and theme is the theme, and word is characterized vocabulary.In LDA model, it is believed that doc includes M Piece document, theme include K maintenance theme, and word includes N number of feature vocabulary, and dictionary corresponding to word includes V vocabulary; Wherein doc → theme and theme → word process are all satisfied Di Li Cray-multinomial (dirichlet-multinomial) point Cloth.The parameter of LDA model includes document-theme distribution parameterWith theme-vocabulary distribution parameterWherein, above-mentioned document-master Inscribe distribution parameterIndicating that theme concentrates each theme is the Making by Probability Sets of the theme of each document in document sets,For M row K Column matrix;Above-mentioned theme-vocabulary distribution parameterIt is characterized what each feature vocabulary in word set was concentrated in semantically expression theme The Making by Probability Sets of each theme,For K row V column matrix.
Theme-vocabulary distribution parameter of training LDA modelProcess be the process of above-mentioned vocabulary probability of solving.It please join Read Fig. 2, Fig. 2 is the process for calculating the method for above-mentioned vocabulary probability, the process the following steps are included:
S105, Training document collection is obtained;
Above-mentioned Training document is the set of Training document, and above-mentioned Training document includes auto repair information.The application is implemented In example, the Training document kept can be obtained from the storage devices such as disk, CD, storage server, is mentioned to reach The quantity of the accuracy of high LDA model, above-mentioned Training document can be at 5000 parts or more.It, can be with after obtaining above-mentioned Training document Above-mentioned document is subjected to label according to 1~M.
S106, the training word set for obtaining Training document collection;
Above-mentioned trained word set is the set of training vocabulary, and above-mentioned trained vocabulary is the feature vocabulary of Training document.Obtain instruction Practice the feature word set of each Training document in document sets, the set of the feature word set of above-mentioned each Training document is training word Collection;Training vocabulary in above-mentioned trained word set is numbered according to 1~N.Obtain the side of the feature word set of each Training document Method sees step S102.
S107, the trained word set and the maintenance theme are inputted in LDA model, exports the vocabulary probability;
By the operation of step S101, S105 and S106, maintenance theme collection and training word set can be obtained;For training For document, it can determine training vocabulary corresponding to above-mentioned Training document, but can not determine corresponding to above-mentioned Training document Maintenance theme and above-mentioned maintenance theme corresponding to feature vocabulary;Therefore, it can use sampling algorithm, according to training text Shelves and training vocabulary corresponding relationship, determine Training document and repair theme corresponding relationship and maintenance theme and training vocabulary Corresponding relationship.Above-mentioned sampling algorithm may include metropolis-hastings sampling, Monte Carlo (markov chain Monte carlo, MCMC) sampling, importance sampling (importance sampling), gibbs sampler (gibbs Sampling) etc..Since the performance of gibbs sampler and accuracy are higher, the embodiment of the present application is by taking gibbs sampler algorithm as an example It is illustrated.Wherein, gibbs sampler formula are as follows:
Wherein, the process of gibbs sampler are as follows:
Circulating sampling is carried out to t=0,1,2 ..., n, it may be assumed that
1)
2)
3)…
4)
5)…
6)
In above-mentioned formula,It indicates to take out in sampling process in the case where selected vocabulary number is w In theme number be k probability;Indicate the counting for the theme that the document reference numeral that number is m is k;It indicates to compile The counting for the vocabulary that number theme reference numeral for being k is t;θm,kIndicate that k-th of theme is the probability of the theme of m documents;Indicate t-th of vocabulary in the probability for semantically stating k-th of theme;αkForDi Li Cray distribution parameter;βtFor θt's Di Li Cray distribution parameter;K, which is the theme, concentrates the number of theme;V is the number of feature vocabulary in dictionary.Above-mentioned gibbs sampler Process in, x indicates that the object that is sampled, the dimension of t data, p be the probability sampled;If the difference of p in adjacent double sampling Within a preset range, it may be considered that gibbs sampler is restrained.
It is above-mentioned to input the trained word set and the maintenance theme in LDA model, export the vocabulary probability, it may include Following steps:
1) random initializtion: the word for being w for each number in training word set, the maintenance that random fit one number is z Theme;
2) the training vocabulary in training word set is scanned, to each trained vocabulary w, according to gibbs sampler formula sample train Maintenance theme corresponding to vocabulary w, and according to sampled result update above-mentioned trained vocabulary z corresponding to maintenance theme;
3) step 2 is repeated until sampled result restrains, and sampled result convergence shows as the difference of Gibbs formula front and back twice Value fluctuates in default range;
4) the matched theme of vocabulary institute in training word set is counted, according to formulaIt calculates above-mentioned Theme-vocabulary distribution parameterValue;Wherein,Indicate the feature vocabulary that the maintenance theme reference numeral that number is k is t It counts;Indicate probability of the feature vocabulary in the theme that semantically expression number is k of number t;βtFor θtDi Li Cray point Cloth parameter;V is the length of the dictionary of training word set;
5) by above-mentioned theme-vocabulary distribution parameterValue be determined as the value of above-mentioned vocabulary probability.
Theme-vocabulary distribution parameter of LDA model can be determined by above step 1~5Above-mentioned theme-vocabulary distribution ParameterThe as vocabulary probability of Training document collection.It in the embodiment of the present application, can be close by the vocabulary probability of document to be sorted Seemingly it is determined as the vocabulary probability of above-mentioned Training document collection.
Step S105~S107 give calculate vocabulary probability process, explained later it is above-mentioned according to features described above word set with And the method for vocabulary probability calculation theme probability.
Above-mentioned theme probability is the probability that above-mentioned document to be sorted corresponds to above-mentioned maintenance theme;Above-mentioned foundation features described above Word set and vocabulary probability calculation theme probability, comprising: features described above word set and above-mentioned vocabulary probability are inputted in LDA model, Export above-mentioned theme probability.It is above-mentioned to may comprise steps of according to the above-mentioned theme probability of above-mentioned LDA model calculating:
1) random initializtion: the word for being w for number each in features described above word set, random fit one number is z's Repair theme;
2) above-mentioned vocabulary probability is substituted into the parameter of gibbs sampler formula
3) the feature vocabulary in scanning feature word set samples feature according to gibbs sampler formula to each feature vocabulary w Theme corresponding to vocabulary w, and the maintenance theme according to corresponding to sampled result update features described above vocabulary z;
4) step 2 is repeated until sampled result restrains, and sampled result convergence shows as the difference of Gibbs formula front and back twice Value fluctuates in default range;
5) theme corresponding to statistical nature vocabulary, according to formulaCalculate the text of LDA model Shelves-theme distribution parameterWherein, θm,kIndicate that k-th of theme is the probability of the theme of m documents, if document to be sorted is only There is one, then m perseverance is 1;Indicate the counting for the theme that the document reference numeral that number is m is k;αkForDi Li Cray Distribution parameter;K is the theme the theme number of concentration;
6) according to above-mentioned document-theme distribution parameterIt is determined as above-mentioned theme probability.
Above step 1~6 can determine document-theme distribution parameter of LDA modelAbove-mentioned document-theme distribution parameter θ is the vector that a length is K, which is the Making by Probability Sets repaired the theme that theme is concentrated and correspond to document to be sorted, on State document-theme distribution parameterTheme probability comprising above-mentioned maintenance theme.
S104, according to above-mentioned theme probability, determine whether above-mentioned maintenance theme is used as the theme of above-mentioned document to be sorted;
Step S103 has calculated the theme probability of document to be sorted, in the embodiment of the present application, it can be stated that if above-mentioned Theme probability is greater than preset value, then the corresponding maintenance theme of above-mentioned theme probability is the maintenance theme of document to be sorted;Alternatively, if Above-mentioned theme probability is vectorIn maximum value, then above-mentioned theme probability it is corresponding maintenance theme be document to be sorted master Topic.
Implement the embodiment of the present application, available document to be sorted and maintenance theme, extracts above-mentioned document to be sorted Feature word set, and according to feature word set and vocabulary probability calculation theme probability, it is according to the above-mentioned maintenance theme of theme determine the probability The no theme for service document of getting on the car.As it can be seen that realizing the embodiment of the present application, the master of auto repair document can be accurately identified Topic improves the efficiency for distinguishing auto repair document, saves the time of maintenance technician.
Fig. 3 is a kind of structural schematic diagram of the equipment of determining auto repair document subject matter provided by the embodiments of the present application, should Equipment may include:
First acquisition unit 301, for obtaining document to be sorted and maintenance theme, the document to be sorted is tieed up comprising automobile Information is repaired, the maintenance theme is theme relevant to auto repair;
Second acquisition unit 302, for obtaining the feature word set of the document to be sorted;
Theme probability calculation unit 303, for according to the feature word set and vocabulary probability calculation theme probability;
Determination unit 304, for determining whether the maintenance theme is used as the text to be sorted according to the theme probability The theme of shelves.
In the embodiment of the present application, above-mentioned first acquisition unit 301 is specifically used for receiving mantenance data to be sorted;By institute It states mantenance data and is converted into the document to be sorted;Specifically, above-mentioned first acquisition unit 301 can be received by mantenance data Channel receives the mantenance data from user or maintenance technician.For example, a maintenance number is arranged in the application software of mobile phone According to uploading channel, car owner can upload mantenance data by above-mentioned channel.Above-mentioned mantenance data uploading channel can be placed in mobile phone, put down In the application software of the mobile terminals such as plate computer, wearable device, being also placed in is the PCs such as laptop, desktop computer In application software in.After obtaining above-mentioned mantenance data, above-mentioned first acquisition unit 301 can also protect above-mentioned mantenance data Service document is saved as, the format of above-mentioned service document includes plain text document format, PDF format, DOCX format etc..
In one possible implementation, above-mentioned first acquisition unit 301 can also be from disk, CD, storage service The service document kept is obtained in the storage devices such as device.The available maintenance theme of above-mentioned first acquisition unit 301;It is optional , same type of maintenance theme merger can also be maintenance theme collection by above-mentioned first acquisition unit 301, and to above-mentioned maintenance The maintenance theme that theme is concentrated is numbered according to 1~Z, and Z is the length for repairing theme collection.
In the embodiment of the present application, above-mentioned second acquisition unit 302 be specifically used for by text segmentation methods to it is described to Classifying documents are segmented, and the word finder of the document to be sorted is obtained;The stop words in the word finder is deleted, is obtained described Feature word set.
In one possible implementation, above-mentioned second acquisition unit 302 is also used to the spy in features described above word set Sign vocabulary is numbered;Above-mentioned second acquisition unit 302 is also used to features described above vocabulary dictionary is added, to the feature in dictionary Vocabulary is numbered.
As shown in figure 4, above-mentioned apparatus further include:
Third acquiring unit 305, for obtaining Training document collection, above-mentioned Training document is the set of Training document, above-mentioned Training document includes auto repair information;
4th acquiring unit 306, for obtaining the feature word set of above-mentioned Training document collection, above-mentioned trained word set is training word The set of remittance, above-mentioned trained vocabulary are the feature vocabulary of Training document;
Vocabulary probability calculation unit 307, it is defeated for inputting the trained word set and the maintenance theme in LDA model The vocabulary probability out.
In the embodiment of the present application, above-mentioned third acquiring unit 305 can also be used in by the training vocabulary of Training document collection into Row number.Above-mentioned 4th acquiring unit 306 can also be used to that the feature vocabulary in word set will be trained to be numbered;Above-mentioned 4th obtains Unit 306 can also be used in the dictionary for establishing above-mentioned trained word set, and feature vocabulary in dictionary is numbered.
In the embodiment of the present application, above-mentioned vocabulary probability calculation unit 307 is specifically used for executing the method in step S107, Determine the vocabulary probability of features described above word set.
In the embodiment of the present application, above-mentioned theme probability calculation unit 303 is specifically used for the feature word set and described Vocabulary probability inputs in LDA model, exports the theme probability.
As it can be seen that the available document to be sorted of equipment and maintenance theme of above-mentioned determining auto repair document subject matter, mention The feature word set of above-mentioned document to be sorted is taken, and according to feature word set and vocabulary probability calculation theme probability, according to theme probability Determine whether above-mentioned maintenance theme is the theme of service document of getting on the car.Auto repair can be accurately identified by above equipment The theme of document improves the efficiency for distinguishing auto repair document, saves the time of maintenance technician.
Referring to Fig. 5, Fig. 5 is the knot of another equipment for determining auto repair document subject matter provided by the embodiments of the present application Structure schematic diagram.The equipment includes: at least one processor 501, such as central processing unit (central processing unit, CPU), at least one network interface 502, user interface 503, memory 504, Database Unit 505, at least one communication bus 506.Wherein, communication bus 506 can be one group of parallel data line, can carry address, data and control signal, for realizing Connection communication between these components.Wherein, user interface 503 may include display screen (display), keyboard (keyboard) Deng.Memory 504 can be high-speed random access memory (random access memory, RAM), be also possible to non-volatile Property memory (non-volatile memory), for example, at least a read-only memory (read-only memory, ROM).It deposits Reservoir 504 optionally can also be that at least one is located remotely from the storage device of aforementioned processor 501.As shown in figure 5, as one It may include operating system, network communication module, Subscriber Interface Module SIM and number in the memory 504 of kind computer storage medium According to branching program.
Network interface 502 is mainly used for connecting client progress data communication;And processor 501 can be used for calling storage The data processor stored in device 504, and execute following operation:
1) mantenance data that client is sent is received by network interface 502, and is converted to document to be sorted.
2) mantenance data for receiving network interface 502 is stored in Database Unit 505, will be in Database Unit 505 Mantenance data saves as document to be sorted.
4) the feature word set for obtaining document to be sorted, specifically includes: by text segmentation methods to the document to be sorted It is segmented, obtains the word finder of the document to be sorted;The stop words in the word finder is deleted, the Feature Words are obtained Collection.
5) according to the feature word set and vocabulary probability calculation theme probability, specifically include: by features described above word set and In above-mentioned vocabulary probability input LDA model, above-mentioned theme probability is exported.
6) Training document collection is obtained from Database Unit 505, the set that above-mentioned Training document integrates as Training document is above-mentioned Training document includes auto repair information;Obtain the training word set of above-mentioned Training document collection;Above-mentioned trained word set is above-mentioned training The feature vocabulary sum aggregate of document sets is closed;By in above-mentioned trained word set and above-mentioned maintenance theme input LDA model, above-mentioned vocabulary is exported Probability.
It is in embodiment in the application, the data processor stored in memory 504 includes that text segments relevant journey Sequence, the above-mentioned calling of processor 501 above procedure treats classifying documents and Training document is segmented;In a kind of possible realization side In formula, above-mentioned processor 501 can also call the data processor in memory 504, carry out to feature word set and training word set Number;Above-mentioned processor 501 can also call the data processor in memory 504, construct the dictionary and feature of trained word set The dictionary of word set.
In the embodiment of the present application, above-mentioned user interface 503 includes display screen and keyboard, for interacting with user; Network communication module in above-mentioned memory is used to carry out network communication with client or server.
In the embodiment of the present application, above-mentioned processor 501 can call the data processing journey stored in above-mentioned memory 504 Sequence obtains maintenance theme;Optionally, above-mentioned processor 501 may call upon the data processing journey stored in above-mentioned memory 504 Theme collection, and the maintenance theme concentrated according to 1~Z to above-mentioned maintenance theme are repaired in same type of maintenance theme merger by sequence It is numbered, Z is the length for repairing theme collection.
In the embodiment of the present application, above-mentioned processor 501 can call the data processing journey stored in above-mentioned memory 504 Sequence exports in features described above word set and above-mentioned vocabulary probability input LDA model above-mentioned theme probability, specifically includes following behaviour Make:
1) random initializtion: the word for being w for number each in features described above word set, random fit one number is z's Repair theme;
2) above-mentioned vocabulary probability is substituted into the parameter of gibbs sampler formula
3) the feature vocabulary in scanning feature word set samples feature according to gibbs sampler formula to each feature vocabulary w Theme corresponding to vocabulary w, and the maintenance theme according to corresponding to sampled result update features described above vocabulary z;
4) step 2 is repeated until sampled result restrains, and sampled result convergence shows as the difference of Gibbs formula front and back twice Value fluctuates in default range;
5) theme corresponding to statistical nature vocabulary, according to formulaCalculate the text of LDA model Shelves-theme distribution parameterWherein, θm,kIndicate that k-th of theme is the probability of the theme of m documents, if document to be sorted is only There is one, then m perseverance is 1;Indicate the counting for the theme that the document reference numeral that number is m is k;αkForDi Li Cray Distribution parameter;K is the theme the theme number of concentration;
6) according to above-mentioned document-theme distribution parameterIt is determined as above-mentioned theme probability.
Above-mentioned processor 501 may call upon the data processor stored in above-mentioned memory 504, by above-mentioned trained word In collection and above-mentioned maintenance theme input LDA model, above-mentioned vocabulary probability is exported, following operation is specifically included:
1) random initializtion: the word for being w for each number in training word set, the maintenance that random fit one number is z Theme;
2) the training vocabulary in training word set is scanned, to each trained vocabulary w, according to gibbs sampler formula sample train Maintenance theme corresponding to vocabulary w, and according to sampled result update above-mentioned trained vocabulary z corresponding to maintenance theme;
3) step 2 is repeated until sampled result restrains, and sampled result convergence shows as the difference of Gibbs formula front and back twice Value fluctuates in default range;
4) the matched theme of vocabulary institute in training word set is counted, according to formulaIt calculates above-mentioned Theme-vocabulary distribution parameterValue;Wherein,Indicate the feature vocabulary that the maintenance theme reference numeral that number is k is t It counts;Indicate probability of the feature vocabulary in the theme that semantically expression number is k of number t;βtFor θtDi Li Cray point Cloth parameter;V is the length of the dictionary of training word set;
5) by above-mentioned theme-vocabulary distribution parameterValue be determined as the value of above-mentioned vocabulary probability.
Above-mentioned processor 501 can also be used to that the data processor stored in memory 504 be called to determine above-mentioned maintenance Theme whether be document to be sorted theme.
As it can be seen that the available document to be sorted of equipment and maintenance theme of above-mentioned determining auto repair document subject matter, mention The feature word set of above-mentioned document to be sorted is taken, and according to feature word set and vocabulary probability calculation theme probability, according to theme probability Determine whether above-mentioned maintenance theme is the theme of service document of getting on the car.Auto repair can be accurately identified by above equipment The theme of document improves the efficiency for distinguishing auto repair document, saves the time of maintenance technician.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can It is completed with instructing relevant hardware by program, which can be stored in a computer readable storage medium, storage Medium include read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), programmable read only memory (programmable read-only memory, PROM), erasable programmable is read-only deposits Reservoir (erasable programmable read Only Memory, EPROM), disposable programmable read-only memory (one- Time programmable read-Only Memory, OTPROM), the electronics formula of erasing can make carbon copies read-only memory (electrically-erasable programmable read-only memory, EEPROM), CD-ROM (compact Disc read-only memory, CD-ROM) or other disc memories, magnetic disk storage, magnetic tape storage or can For carrying or any other computer-readable medium of storing data.
The evaluation method and equipment of a kind of Automobile Service Factory disclosed in the embodiment of the present application are described in detail above, Specific examples are used herein to illustrate the principle and implementation manner of the present application, and the explanation of above embodiments is only used The present processes and its core concept are understood in help;At the same time, for those skilled in the art, according to the application's Thought, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification should not be construed as Limitation to the application.

Claims (10)

1. a kind of method of determining auto repair document subject matter characterized by comprising
Obtain document to be sorted and maintenance theme, the document to be sorted include auto repair information, the maintenance theme be and The relevant theme of auto repair;
The feature word set of the document to be sorted is obtained, the feature word set is the collection of the feature vocabulary of the document to be sorted It closes;
According to the feature word set and vocabulary probability calculation theme probability, the vocabulary probability is that the Feature Words concentrate each spy For sign vocabulary in the probability for semantically expressing the maintenance theme, the theme probability is that the document to be sorted corresponds to the dimension Repair the probability of theme;
According to the theme probability, determine whether the maintenance theme is used as the theme of the document to be sorted.
2. method according to claim 1, which is characterized in that described to obtain document to be sorted and maintenance theme, comprising:
Receive mantenance data to be sorted;The document to be sorted is converted by the mantenance data.
3. method according to claim 1, which is characterized in that the feature word set for obtaining the document to be sorted, comprising:
The document to be sorted is segmented by text segmentation methods, obtains the word finder of the document to be sorted;
The stop words in the word finder is deleted, the feature word set is obtained.
4. method according to claim 1, which is characterized in that after acquisition document to be sorted and maintenance theme, It is described according to the feature word set and vocabulary probability calculation theme probability before, further includes:
Training document collection, the set that the Training document integrates as Training document are obtained, the Training document is believed comprising auto repair Breath;
Obtain the training word set of the Training document collection;The trained word set is the collection of the feature vocabulary of the Training document collection It closes;
The trained word set and the maintenance theme are inputted in implicit Di Li Cray distribution LDA model, it is general to export the vocabulary Rate.
5. according to claim 1 or 4 the methods, which is characterized in that described according to the feature word set and vocabulary probability meter Calculate theme probability, comprising:
The feature word set and the vocabulary probability are inputted in implicit Di Li Cray distribution LDA model, it is general to export the theme Rate.
6. a kind of equipment of determining auto repair document subject matter characterized by comprising
First acquisition unit, for obtaining document to be sorted and maintenance theme, the document to be sorted includes auto repair information, The maintenance theme is theme relevant to auto repair;
Second acquisition unit, for obtaining the feature word set of the document to be sorted, the feature word set is the text to be sorted The set of the feature vocabulary of shelves;
Theme probability calculation unit, for general according to the feature word set and vocabulary probability calculation theme probability, the vocabulary Rate is that the Feature Words concentrate each feature vocabulary in the probability for semantically expressing the maintenance theme, and the theme probability is described Document to be sorted corresponds to the probability of the maintenance theme;
Determination unit, for determining whether the maintenance theme is used as the master of the document to be sorted according to the theme probability Topic.
7. equipment according to claim 6, which is characterized in that further include:
Third acquiring unit, for obtaining Training document collection, the set that the Training document integrates as Training document, the training text Shelves include auto repair information;
4th acquiring unit, for obtaining the feature word set of the Training document collection, the trained word set is the Training document The set of the feature vocabulary of collection;
Vocabulary probability calculation unit is distributed LDA for the trained word set and the maintenance theme to be inputted implicit Di Li Cray In model, the vocabulary probability is exported.
8. according to claim 6 or 7 equipment, which is characterized in that the computing unit be specifically used for feature word set and The vocabulary probability inputs in implicit Di Li Cray distribution LDA model, exports theme probability.
9. a kind of equipment of determining auto repair document subject matter, which is characterized in that including processor, memory, Database Unit, Network interface, communication bus and user interface;The processor, the memory, the Database Unit, the network interface It is connected with each other with the user interface by the communication bus;Wherein, the memory is described for storing computer program Computer program includes program instruction, and the processor is configured for that described program is called to instruct, and execution such as claim 1~ 6 a kind of described in any item determining auto repair document subject matter methods.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey Sequence, the computer program include program instruction, and described program instruction executes the processor such as A kind of described in any item methods of determining auto repair document subject matter of Claims 1 to 5.
CN201811075837.7A 2018-09-14 2018-09-14 A kind of method and relevant device of determining auto repair document subject matter Pending CN109446318A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811075837.7A CN109446318A (en) 2018-09-14 2018-09-14 A kind of method and relevant device of determining auto repair document subject matter

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811075837.7A CN109446318A (en) 2018-09-14 2018-09-14 A kind of method and relevant device of determining auto repair document subject matter

Publications (1)

Publication Number Publication Date
CN109446318A true CN109446318A (en) 2019-03-08

Family

ID=65532568

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811075837.7A Pending CN109446318A (en) 2018-09-14 2018-09-14 A kind of method and relevant device of determining auto repair document subject matter

Country Status (1)

Country Link
CN (1) CN109446318A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110717038A (en) * 2019-09-17 2020-01-21 腾讯科技(深圳)有限公司 Object classification method and device
CN113704471A (en) * 2021-08-26 2021-11-26 唯品会(广州)软件有限公司 Statement classification method, device, equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140278291A1 (en) * 2013-03-14 2014-09-18 Microsoft Corporation Discovering functional groups
CN107193892A (en) * 2017-05-02 2017-09-22 东软集团股份有限公司 A kind of document subject matter determines method and device
US20180032600A1 (en) * 2016-08-01 2018-02-01 International Business Machines Corporation Phenomenological semantic distance from latent dirichlet allocations (lda) classification
CN107832298A (en) * 2017-11-16 2018-03-23 北京百度网讯科技有限公司 Method and apparatus for output information
CN108399228A (en) * 2018-02-12 2018-08-14 平安科技(深圳)有限公司 Article sorting technique, device, computer equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140278291A1 (en) * 2013-03-14 2014-09-18 Microsoft Corporation Discovering functional groups
US20180032600A1 (en) * 2016-08-01 2018-02-01 International Business Machines Corporation Phenomenological semantic distance from latent dirichlet allocations (lda) classification
CN107193892A (en) * 2017-05-02 2017-09-22 东软集团股份有限公司 A kind of document subject matter determines method and device
CN107832298A (en) * 2017-11-16 2018-03-23 北京百度网讯科技有限公司 Method and apparatus for output information
CN108399228A (en) * 2018-02-12 2018-08-14 平安科技(深圳)有限公司 Article sorting technique, device, computer equipment and storage medium

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110717038A (en) * 2019-09-17 2020-01-21 腾讯科技(深圳)有限公司 Object classification method and device
CN113704471A (en) * 2021-08-26 2021-11-26 唯品会(广州)软件有限公司 Statement classification method, device, equipment and storage medium
CN113704471B (en) * 2021-08-26 2024-02-02 唯品会(广州)软件有限公司 Sentence classification method, sentence classification device, sentence classification equipment and sentence classification storage medium

Similar Documents

Publication Publication Date Title
CN108073568B (en) Keyword extraction method and device
CN110020422B (en) Feature word determining method and device and server
CN109815487B (en) Text quality inspection method, electronic device, computer equipment and storage medium
Xie et al. Detecting duplicate bug reports with convolutional neural networks
CN109872162B (en) Wind control classification and identification method and system for processing user complaint information
CN104160392B (en) Semantic estimating unit, method
US20150310096A1 (en) Comparing document contents using a constructed topic model
WO2020073714A1 (en) Training sample obtaining method, account prediction method, and corresponding devices
CN111797214A (en) FAQ database-based problem screening method and device, computer equipment and medium
US20130060769A1 (en) System and method for identifying social media interactions
CN107102993B (en) User appeal analysis method and device
CN107844533A (en) A kind of intelligent Answer System and analysis method
CN113743111B (en) Financial risk prediction method and device based on text pre-training and multi-task learning
WO2024109619A1 (en) Sensitive data identification method and apparatus, device, and computer storage medium
CN107729917A (en) The sorting technique and device of a kind of title
CN109740642A (en) Invoice category recognition methods, device, electronic equipment and readable storage medium storing program for executing
Wu et al. Extracting topics based on Word2Vec and improved Jaccard similarity coefficient
CN103605691A (en) Device and method used for processing issued contents in social network
Zhang et al. Relation classification: Cnn or rnn?
CN102402717A (en) Data analysis facility and method
CN110287409A (en) A kind of webpage type identification method and device
CN110941702A (en) Retrieval method and device for laws and regulations and laws and readable storage medium
CN112181490A (en) Method, device, equipment and medium for identifying function category in function point evaluation method
CN109446318A (en) A kind of method and relevant device of determining auto repair document subject matter
CN110532359A (en) Legal provision query method, apparatus, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190308