CN109446318A - A kind of method and relevant device of determining auto repair document subject matter - Google Patents
A kind of method and relevant device of determining auto repair document subject matter Download PDFInfo
- Publication number
- CN109446318A CN109446318A CN201811075837.7A CN201811075837A CN109446318A CN 109446318 A CN109446318 A CN 109446318A CN 201811075837 A CN201811075837 A CN 201811075837A CN 109446318 A CN109446318 A CN 109446318A
- Authority
- CN
- China
- Prior art keywords
- theme
- document
- probability
- vocabulary
- sorted
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
This application discloses the methods and relevant device of a kind of determining auto repair document subject matter, this method comprises: obtaining document to be sorted and maintenance theme;Obtain the feature word set of the document to be sorted;According to the feature word set and vocabulary probability calculation theme probability;According to the theme probability, determine whether the maintenance theme is used as the theme of the document to be sorted.Using the application, the theme of auto repair document can be accurately identified, improve the efficiency for distinguishing auto repair document, save the time of maintenance technician.
Description
Technical field
This application involves field of computer technology more particularly to a kind of methods and correlation of determining auto repair document subject matter
Equipment.
Background technique
During auto repair, a large amount of service document can be generated.These documents include many related to auto repair
Information, these information are efficiently used, it is horizontal to can be improved Automobile Service Factory maintenance, increases the satisfaction of client.And
It identifies theme described in auto repair document, corresponding suggestion and scheme can be provided for above-mentioned theme, gives maintenance process
Bring beneficial effect.The type of above-mentioned theme includes vehicle, malfunctioning module, vehicle brand etc..Typically, since automobile zero is matched
Part is various, system complex, would generally mention multiple vehicles, components or system, maintenance technician in a service document
It needs to read over, could accurately judge theme described in these service documents.In this process, maintenance technician
Consume plenty of time and energy.
The prior art searches for the key to match in auto repair document with some theme by key word matching method
Word, and then judge the theme of the document.But this method level is fairly simple, can not accurately judge auto repair document
Theme.
Summary of the invention
The application proposes the method and relevant device of a kind of determining auto repair document subject matter, is determined for automobile dimension
It improves literature the themes of shelves, improves the efficiency for distinguishing auto repair document.
In a first aspect, a kind of method that the application proposes determining auto repair document subject matter, comprising:
Document to be sorted and maintenance theme are obtained, the document to be sorted includes auto repair information, the maintenance theme
For theme relevant to auto repair;
The feature word set of the document to be sorted is obtained, the feature word set is the feature vocabulary of the document to be sorted
Set;
According to the feature word set and vocabulary probability calculation theme probability, the vocabulary probability is Feature Words concentration
Each feature vocabulary corresponds to institute in the probability for semantically expressing the maintenance theme, the theme probability for the document to be sorted
State the probability of maintenance theme;
According to the theme probability, determine whether the maintenance theme is used as the theme of the document to be sorted.
With reference to first aspect, in one possible implementation, it is described obtain document to be sorted and maintenance theme it
Afterwards, before described according to the feature word set and vocabulary probability calculation theme probability, further includes:
Training document collection, the set that the Training document integrates as Training document are obtained, the Training document is tieed up comprising automobile
Repair information;
Obtain the training word set of the Training document collection;The trained word set is the feature vocabulary of the Training document collection
Set;
The trained word set and the maintenance theme are inputted in implicit Di Li Cray distribution LDA model, institute's predicate is exported
Remittance probability.
With reference to first aspect, in one possible implementation, described according to the feature word set and vocabulary probability
Calculate theme probability, comprising:
The feature word set and the vocabulary probability are inputted in implicit Di Li Cray distribution LDA model, the master is exported
Inscribe probability.
With reference to first aspect, in one possible implementation, according to the theme probability, the maintenance master is determined
Whether topic is after the theme as the document to be sorted, further includes:
If it is determined that the maintenance theme, then be added to by maintenance theme of the maintenance theme as the document to be sorted
In the title of the document to be sorted;Alternatively, by the document storage to be sorted to memory block corresponding to the maintenance theme
Domain.
Second aspect, the embodiment of the present application provide a kind of equipment of determining auto repair document subject matter, comprising:
First acquisition unit, for obtaining document to be sorted and maintenance theme, the document to be sorted includes auto repair
Information, the maintenance theme are theme relevant to auto repair;
Second acquisition unit, for obtaining the feature word set of the document to be sorted, the feature word set is described wait divide
The set of the feature vocabulary of class document;
Theme probability calculation unit, for according to the feature word set and vocabulary probability calculation theme probability, institute's predicate
The probability that converges is that the Feature Words concentrate each feature vocabulary in the probability for semantically expressing the maintenance theme, and the theme probability is
The document to be sorted corresponds to the probability of the maintenance theme;
Determination unit, for determining whether the maintenance theme is used as the document to be sorted according to the theme probability
Theme.
In conjunction with second aspect, in one possible implementation, the equipment, further includes:
Third acquiring unit, for obtaining Training document collection, the set that the Training document integrates as Training document, the instruction
Practicing document includes auto repair information;
4th acquiring unit, for obtaining the training word set of the Training document collection, the trained word set is the training
The set of the feature vocabulary of document sets;
Vocabulary probability calculation unit, for the trained word set and the maintenance theme to be inputted implicit Di Li Cray distribution
In LDA model, the vocabulary probability is exported.
The embodiment of the present application third aspect discloses another equipment for determining auto repair document subject matter, including processing
Device, memory, Database Unit, network interface, communication bus and user interface;Above-mentioned processor, above-mentioned memory, above-mentioned number
It is connected with each other according to library unit, above-mentioned network interface and above-mentioned user interface by above-mentioned communication bus;Wherein, above-mentioned memory is used
In storage computer program, above-mentioned computer program includes program instruction, and above-mentioned processor is configured for calling above procedure
Instruction, the method for executing above-mentioned first aspect.
The embodiment of the present application fourth aspect discloses a kind of computer readable storage medium, above-mentioned computer-readable storage medium
Matter is stored with computer program, and above-mentioned computer program includes program instruction, and above procedure instruction makes when being executed by a processor
The method that above-mentioned processor executes above-mentioned first aspect.
Implement the embodiment of the present application to have the advantages that
In the embodiment of the present application, document to be sorted and maintenance theme are obtained, automobile document to be sorted is extracted
Feature word set, and according to feature word set and vocabulary probability calculation theme probability, according to the above-mentioned maintenance theme of theme determine the probability
It whether is the theme of service document of getting on the car.As it can be seen that can accurately identify auto repair text by realizing the embodiment of the present application
The theme of shelves improves the efficiency for distinguishing auto repair document, saves the time of maintenance technician.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application or in background technique below will be implemented the application
Attached drawing needed in example or background technique is illustrated.
Fig. 1 is a kind of flow diagram of the method for determining auto repair document subject matter provided by the embodiments of the present application;
Fig. 2 is a kind of flow diagram of method for calculating vocabulary probability provided by the embodiments of the present application;
Fig. 3 is a kind of structural schematic diagram of the equipment of determining auto repair document subject matter provided by the embodiments of the present application;
Fig. 4 is the structural schematic diagram of another equipment for determining auto repair document subject matter provided by the embodiments of the present application;
Fig. 5 is the structural schematic diagram of another equipment for determining auto repair document subject matter provided by the embodiments of the present application.
Specific embodiment
The description and claims of this application and term " first " in above-mentioned attached drawing, " second " etc. are for distinguishing
Different objects, is not use to describe a particular order.In addition, term " includes " and " having " and their any deformations, meaning
Figure, which is to cover, non-exclusive includes.Such as contain the process, method, system, product or equipment of a series of steps or units
It is not limited to listed step or unit, but optionally further comprising the step of not listing or unit, or optionally also
Including other step or units intrinsic for these process, methods or equipment.
The application proposes the method and relevant device of a kind of determining auto repair document subject matter, is determined for automobile dimension
It improves literature the themes of shelves, improves the efficiency that maintenance technician distinguishes automobile document, save the time of maintenance technician.
The embodiment of the present application is described below in conjunction with attached drawing.
Fig. 1 is a kind of flow diagram of the method for determining auto repair document subject matter provided by the embodiments of the present application.On
State method the following steps are included:
S101, document to be sorted and maintenance theme are obtained;
During auto repair, a large amount of service document can be generated.These documents include many related to auto repair
Information, determine the theme of these documents to be sorted, these documents efficiently used, can be improved Automobile Service Factory maintenance
Level improves maintenance efficiency.
In the embodiment of the present application, on the one hand, can receive by mantenance data receiving channel and come from user or maintenance
The mantenance data of technician.For example, a mantenance data uploading channel is arranged in the application software of mobile phone, car owner can be by above-mentioned
Channel uploads mantenance data.It is mobile eventually that above-mentioned mantenance data uploading channel can be placed in mobile phone, tablet computer, wearable device etc.
In the application software at end, being also placed in is in application software in the PCs such as laptop, desktop computer.It is above-mentioned obtaining
After mantenance data, above-mentioned mantenance data can be saved as to service document, the format of above-mentioned service document includes plain text document
Format, PDF format, DOCX format etc..On the other hand, it can be obtained from the storage devices such as disk, CD, storage server
The service document kept.
It should be noted that there are many different types for above-mentioned maintenance theme, it can be according to the actual demand of maintenance process
To obtain above-mentioned maintenance theme.For example, can according to automobile brand obtain maintenance theme, above-mentioned maintenance theme can for
It speeds, the automobile brands such as BMW or tesla;Alternatively, above-mentioned maintenance theme, above-mentioned maintenance theme can be obtained according to automobile component
It can be the automobile components such as speed changer, clutch or engine;Alternatively, maintenance theme, above-mentioned vehicle can be obtained according to vehicle
It can be the vehicles such as compact car, medium vehicle, three-box car or advanced vehicle.It in the embodiment of the present application, can be by same type of dimension
It repairs theme and is saved in maintenance theme concentration, and be numbered, to carry out the training and prediction of topic model.For example, may be used
It is that Z repairs theme collection, and is numbered according to 1~Z to obtain length according to automobile component.After obtaining maintenance theme collection,
Above-mentioned maintenance theme collection can be read out in the matrix form and operation, alternatively, be read out in the form of storehouse set and
Data processing and arithmetic speed, while the matching of theme easy to repair and service document are accelerated in operation.Determining document to be sorted
Maintenance theme during, above-mentioned maintenance theme can be calculated, each maintenance theme is concentrated to correspond to the probability set of above-mentioned service document
It closes, by comparing the size of each probability in above-mentioned Making by Probability Sets, determines the theme of above-mentioned document to be sorted.
S102, the feature word set for obtaining the document to be sorted;
In specific application scenarios, topic model or disaggregated model pass through the feature word set to service document to be sorted
It is analyzed, with the theme of determination document to be sorted.In the embodiment of the present application, can by text segmentation methods to it is above-mentioned to point
Class document is segmented, and deletes the stop words segmented in obtained word finder, obtains the feature word set of document to be sorted.?
After obtaining feature word set, features described above word set can be numbered according to 1~N, wherein N is characterized the sum of vocabulary.
It should be noted that Feature Words, which are concentrated, to include identical feature vocabulary, in order to count corresponding to each theme
The same feature vocabulary number, the dictionary of features described above word set can be constructed.Each feature vocabulary in above-mentioned dictionary only
Occur once, the feature vocabulary in above-mentioned dictionary being numbered according to 1~V, wherein V is characterized feature vocabulary in dictionary
Number.Wherein, N >=V.
In the embodiment of the present application, above-mentioned document to be sorted is segmented by text segmentation methods.Above-mentioned text point
Word algorithm can be hidden Markov model (hidden markov model, HMM), Viterbi (viterbi) algorithm, condition
Random field (conditional random field, CRF) model or maximum entropy (maximum entropy) model.Wherein, have
Programming tool packet of many based on above-mentioned algorithm can load above-mentioned programming tool packet and carry out quickly in actual application
Participle.
A kind of utilization text segmentation methods acquisition feature word set, and the method for constructing the dictionary of feature word set is set forth below,
Implement step are as follows:
1) document format is converted;In order to facilitate text-processing is carried out, above-mentioned document can be uniformly converted into plain text format
Document.
2) participle tool is called to be segmented;For example, the stammerer that crawler (python) programming language can be called to write
(jieba) tool is segmented, above-mentioned participle tool is based on viterbi algorithm and hidden Markov model, supports syntype, accurate mould
Formula, is based on word frequency-inverse document frequency (term frequency-inverse document at search engine mode
Frequency, TF-IDF) algorithm keyword abstraction mode totally four kinds of participle modes;Various participle modes can be attempted respectively,
The participle efficiency of above-mentioned various participle modes is calculated, and optimal participle mode is selected according to participle efficiency.
3) stop words is handled;After the participle for completing document to be sorted, word finder is obtained, includes mark in above-mentioned word finder
The stop words such as point symbol, conjunction, modal particle, above-mentioned stop words do not have substantive significance to text classification;It can load deactivated
Vocabulary, and above-mentioned stop words is deleted by key word matching method, obtain feature word set.
4) dictionary of feature word set is constructed;Dictionary, features described above vocabulary is added in feature vocabulary in features described above word set
Only occur in dictionary primary;In the present embodiment, it can call Open-Source Tools that dictionary is added in above-mentioned vocabulary;For example, using
Dictionary is added in feature vocabulary by Jesse's nurse (gensim) Open-Source Tools packet, obtains the dictionary of features described above word set.
S103, according to the feature word set and vocabulary probability calculation theme probability;
In the embodiment of the present application, above-mentioned theme can be calculated using LDA model according to feature word set and vocabulary probability
Probability.Above-mentioned vocabulary probability is each feature vocabulary in features described above word set in the probability for semantically expressing above-mentioned maintenance theme;On
Stating theme probability is the probability that above-mentioned document to be sorted corresponds to the maintenance theme.
The embodiment of the present application mainly passes through LDA model and calculates above-mentioned vocabulary probability.The meaning of LDA model is described below:
The method of traditional two document associations of judgement is the number by checking word that two documents occur jointly,
Such as TF-IDF method.But this method does not consider the semantic association of text behind, it is possible to which two documents are common
The word of appearance is seldom even without but two documents are being semantically associated.For example, it is assumed that there are two sentences, this two
A sentence respectively indicates a document.First sentence is " Qiao Busi is from us ", and second sentence is " apple
The price of mobile phone can or can not drop? ".Although there is no identical words for the two sentences, the two is being semantically associated
, that is, belong to theme " Apple Inc. ".So obtaining the theme mould of the theme of document by being excavated to document semantic
Type comes into being.Implicit Di Li Cray distribution (latent dirichlet allocation, LDA) model is a kind of three layers of shellfish
This probabilistic model of leaf, includes word, theme and document three-decker, be it is a kind of commonly by the semanteme to document excavated into
And obtain the topic model of document subject matter to be sorted.
LDA model defines a kind of document structure tree process, specific steps are as follows: a. is to each document, from theme distribution
Extract a theme;B. a word is extracted from the distribution of word corresponding to the above-mentioned theme being pumped to;C. above-mentioned mistake is repeated
Cheng Zhizhi traverses each of document word.As it can be seen that above-mentioned LDA model generates one by doc → theme → word process
Document, wherein doc is document, and theme is the theme, and word is characterized vocabulary.In LDA model, it is believed that doc includes M
Piece document, theme include K maintenance theme, and word includes N number of feature vocabulary, and dictionary corresponding to word includes V vocabulary;
Wherein doc → theme and theme → word process are all satisfied Di Li Cray-multinomial (dirichlet-multinomial) point
Cloth.The parameter of LDA model includes document-theme distribution parameterWith theme-vocabulary distribution parameterWherein, above-mentioned document-master
Inscribe distribution parameterIndicating that theme concentrates each theme is the Making by Probability Sets of the theme of each document in document sets,For M row K
Column matrix;Above-mentioned theme-vocabulary distribution parameterIt is characterized what each feature vocabulary in word set was concentrated in semantically expression theme
The Making by Probability Sets of each theme,For K row V column matrix.
Theme-vocabulary distribution parameter of training LDA modelProcess be the process of above-mentioned vocabulary probability of solving.It please join
Read Fig. 2, Fig. 2 is the process for calculating the method for above-mentioned vocabulary probability, the process the following steps are included:
S105, Training document collection is obtained;
Above-mentioned Training document is the set of Training document, and above-mentioned Training document includes auto repair information.The application is implemented
In example, the Training document kept can be obtained from the storage devices such as disk, CD, storage server, is mentioned to reach
The quantity of the accuracy of high LDA model, above-mentioned Training document can be at 5000 parts or more.It, can be with after obtaining above-mentioned Training document
Above-mentioned document is subjected to label according to 1~M.
S106, the training word set for obtaining Training document collection;
Above-mentioned trained word set is the set of training vocabulary, and above-mentioned trained vocabulary is the feature vocabulary of Training document.Obtain instruction
Practice the feature word set of each Training document in document sets, the set of the feature word set of above-mentioned each Training document is training word
Collection;Training vocabulary in above-mentioned trained word set is numbered according to 1~N.Obtain the side of the feature word set of each Training document
Method sees step S102.
S107, the trained word set and the maintenance theme are inputted in LDA model, exports the vocabulary probability;
By the operation of step S101, S105 and S106, maintenance theme collection and training word set can be obtained;For training
For document, it can determine training vocabulary corresponding to above-mentioned Training document, but can not determine corresponding to above-mentioned Training document
Maintenance theme and above-mentioned maintenance theme corresponding to feature vocabulary;Therefore, it can use sampling algorithm, according to training text
Shelves and training vocabulary corresponding relationship, determine Training document and repair theme corresponding relationship and maintenance theme and training vocabulary
Corresponding relationship.Above-mentioned sampling algorithm may include metropolis-hastings sampling, Monte Carlo (markov chain
Monte carlo, MCMC) sampling, importance sampling (importance sampling), gibbs sampler (gibbs
Sampling) etc..Since the performance of gibbs sampler and accuracy are higher, the embodiment of the present application is by taking gibbs sampler algorithm as an example
It is illustrated.Wherein, gibbs sampler formula are as follows:
Wherein, the process of gibbs sampler are as follows:
Circulating sampling is carried out to t=0,1,2 ..., n, it may be assumed that
1)
2)
3)…
4)
5)…
6)
In above-mentioned formula,It indicates to take out in sampling process in the case where selected vocabulary number is w
In theme number be k probability;Indicate the counting for the theme that the document reference numeral that number is m is k;It indicates to compile
The counting for the vocabulary that number theme reference numeral for being k is t;θm,kIndicate that k-th of theme is the probability of the theme of m documents;Indicate t-th of vocabulary in the probability for semantically stating k-th of theme;αkForDi Li Cray distribution parameter;βtFor θt's
Di Li Cray distribution parameter;K, which is the theme, concentrates the number of theme;V is the number of feature vocabulary in dictionary.Above-mentioned gibbs sampler
Process in, x indicates that the object that is sampled, the dimension of t data, p be the probability sampled;If the difference of p in adjacent double sampling
Within a preset range, it may be considered that gibbs sampler is restrained.
It is above-mentioned to input the trained word set and the maintenance theme in LDA model, export the vocabulary probability, it may include
Following steps:
1) random initializtion: the word for being w for each number in training word set, the maintenance that random fit one number is z
Theme;
2) the training vocabulary in training word set is scanned, to each trained vocabulary w, according to gibbs sampler formula sample train
Maintenance theme corresponding to vocabulary w, and according to sampled result update above-mentioned trained vocabulary z corresponding to maintenance theme;
3) step 2 is repeated until sampled result restrains, and sampled result convergence shows as the difference of Gibbs formula front and back twice
Value fluctuates in default range;
4) the matched theme of vocabulary institute in training word set is counted, according to formulaIt calculates above-mentioned
Theme-vocabulary distribution parameterValue;Wherein,Indicate the feature vocabulary that the maintenance theme reference numeral that number is k is t
It counts;Indicate probability of the feature vocabulary in the theme that semantically expression number is k of number t;βtFor θtDi Li Cray point
Cloth parameter;V is the length of the dictionary of training word set;
5) by above-mentioned theme-vocabulary distribution parameterValue be determined as the value of above-mentioned vocabulary probability.
Theme-vocabulary distribution parameter of LDA model can be determined by above step 1~5Above-mentioned theme-vocabulary distribution
ParameterThe as vocabulary probability of Training document collection.It in the embodiment of the present application, can be close by the vocabulary probability of document to be sorted
Seemingly it is determined as the vocabulary probability of above-mentioned Training document collection.
Step S105~S107 give calculate vocabulary probability process, explained later it is above-mentioned according to features described above word set with
And the method for vocabulary probability calculation theme probability.
Above-mentioned theme probability is the probability that above-mentioned document to be sorted corresponds to above-mentioned maintenance theme;Above-mentioned foundation features described above
Word set and vocabulary probability calculation theme probability, comprising: features described above word set and above-mentioned vocabulary probability are inputted in LDA model,
Export above-mentioned theme probability.It is above-mentioned to may comprise steps of according to the above-mentioned theme probability of above-mentioned LDA model calculating:
1) random initializtion: the word for being w for number each in features described above word set, random fit one number is z's
Repair theme;
2) above-mentioned vocabulary probability is substituted into the parameter of gibbs sampler formula
3) the feature vocabulary in scanning feature word set samples feature according to gibbs sampler formula to each feature vocabulary w
Theme corresponding to vocabulary w, and the maintenance theme according to corresponding to sampled result update features described above vocabulary z;
4) step 2 is repeated until sampled result restrains, and sampled result convergence shows as the difference of Gibbs formula front and back twice
Value fluctuates in default range;
5) theme corresponding to statistical nature vocabulary, according to formulaCalculate the text of LDA model
Shelves-theme distribution parameterWherein, θm,kIndicate that k-th of theme is the probability of the theme of m documents, if document to be sorted is only
There is one, then m perseverance is 1;Indicate the counting for the theme that the document reference numeral that number is m is k;αkForDi Li Cray
Distribution parameter;K is the theme the theme number of concentration;
6) according to above-mentioned document-theme distribution parameterIt is determined as above-mentioned theme probability.
Above step 1~6 can determine document-theme distribution parameter of LDA modelAbove-mentioned document-theme distribution parameter
θ is the vector that a length is K, which is the Making by Probability Sets repaired the theme that theme is concentrated and correspond to document to be sorted, on
State document-theme distribution parameterTheme probability comprising above-mentioned maintenance theme.
S104, according to above-mentioned theme probability, determine whether above-mentioned maintenance theme is used as the theme of above-mentioned document to be sorted;
Step S103 has calculated the theme probability of document to be sorted, in the embodiment of the present application, it can be stated that if above-mentioned
Theme probability is greater than preset value, then the corresponding maintenance theme of above-mentioned theme probability is the maintenance theme of document to be sorted;Alternatively, if
Above-mentioned theme probability is vectorIn maximum value, then above-mentioned theme probability it is corresponding maintenance theme be document to be sorted master
Topic.
Implement the embodiment of the present application, available document to be sorted and maintenance theme, extracts above-mentioned document to be sorted
Feature word set, and according to feature word set and vocabulary probability calculation theme probability, it is according to the above-mentioned maintenance theme of theme determine the probability
The no theme for service document of getting on the car.As it can be seen that realizing the embodiment of the present application, the master of auto repair document can be accurately identified
Topic improves the efficiency for distinguishing auto repair document, saves the time of maintenance technician.
Fig. 3 is a kind of structural schematic diagram of the equipment of determining auto repair document subject matter provided by the embodiments of the present application, should
Equipment may include:
First acquisition unit 301, for obtaining document to be sorted and maintenance theme, the document to be sorted is tieed up comprising automobile
Information is repaired, the maintenance theme is theme relevant to auto repair;
Second acquisition unit 302, for obtaining the feature word set of the document to be sorted;
Theme probability calculation unit 303, for according to the feature word set and vocabulary probability calculation theme probability;
Determination unit 304, for determining whether the maintenance theme is used as the text to be sorted according to the theme probability
The theme of shelves.
In the embodiment of the present application, above-mentioned first acquisition unit 301 is specifically used for receiving mantenance data to be sorted;By institute
It states mantenance data and is converted into the document to be sorted;Specifically, above-mentioned first acquisition unit 301 can be received by mantenance data
Channel receives the mantenance data from user or maintenance technician.For example, a maintenance number is arranged in the application software of mobile phone
According to uploading channel, car owner can upload mantenance data by above-mentioned channel.Above-mentioned mantenance data uploading channel can be placed in mobile phone, put down
In the application software of the mobile terminals such as plate computer, wearable device, being also placed in is the PCs such as laptop, desktop computer
In application software in.After obtaining above-mentioned mantenance data, above-mentioned first acquisition unit 301 can also protect above-mentioned mantenance data
Service document is saved as, the format of above-mentioned service document includes plain text document format, PDF format, DOCX format etc..
In one possible implementation, above-mentioned first acquisition unit 301 can also be from disk, CD, storage service
The service document kept is obtained in the storage devices such as device.The available maintenance theme of above-mentioned first acquisition unit 301;It is optional
, same type of maintenance theme merger can also be maintenance theme collection by above-mentioned first acquisition unit 301, and to above-mentioned maintenance
The maintenance theme that theme is concentrated is numbered according to 1~Z, and Z is the length for repairing theme collection.
In the embodiment of the present application, above-mentioned second acquisition unit 302 be specifically used for by text segmentation methods to it is described to
Classifying documents are segmented, and the word finder of the document to be sorted is obtained;The stop words in the word finder is deleted, is obtained described
Feature word set.
In one possible implementation, above-mentioned second acquisition unit 302 is also used to the spy in features described above word set
Sign vocabulary is numbered;Above-mentioned second acquisition unit 302 is also used to features described above vocabulary dictionary is added, to the feature in dictionary
Vocabulary is numbered.
As shown in figure 4, above-mentioned apparatus further include:
Third acquiring unit 305, for obtaining Training document collection, above-mentioned Training document is the set of Training document, above-mentioned
Training document includes auto repair information;
4th acquiring unit 306, for obtaining the feature word set of above-mentioned Training document collection, above-mentioned trained word set is training word
The set of remittance, above-mentioned trained vocabulary are the feature vocabulary of Training document;
Vocabulary probability calculation unit 307, it is defeated for inputting the trained word set and the maintenance theme in LDA model
The vocabulary probability out.
In the embodiment of the present application, above-mentioned third acquiring unit 305 can also be used in by the training vocabulary of Training document collection into
Row number.Above-mentioned 4th acquiring unit 306 can also be used to that the feature vocabulary in word set will be trained to be numbered;Above-mentioned 4th obtains
Unit 306 can also be used in the dictionary for establishing above-mentioned trained word set, and feature vocabulary in dictionary is numbered.
In the embodiment of the present application, above-mentioned vocabulary probability calculation unit 307 is specifically used for executing the method in step S107,
Determine the vocabulary probability of features described above word set.
In the embodiment of the present application, above-mentioned theme probability calculation unit 303 is specifically used for the feature word set and described
Vocabulary probability inputs in LDA model, exports the theme probability.
As it can be seen that the available document to be sorted of equipment and maintenance theme of above-mentioned determining auto repair document subject matter, mention
The feature word set of above-mentioned document to be sorted is taken, and according to feature word set and vocabulary probability calculation theme probability, according to theme probability
Determine whether above-mentioned maintenance theme is the theme of service document of getting on the car.Auto repair can be accurately identified by above equipment
The theme of document improves the efficiency for distinguishing auto repair document, saves the time of maintenance technician.
Referring to Fig. 5, Fig. 5 is the knot of another equipment for determining auto repair document subject matter provided by the embodiments of the present application
Structure schematic diagram.The equipment includes: at least one processor 501, such as central processing unit (central processing unit,
CPU), at least one network interface 502, user interface 503, memory 504, Database Unit 505, at least one communication bus
506.Wherein, communication bus 506 can be one group of parallel data line, can carry address, data and control signal, for realizing
Connection communication between these components.Wherein, user interface 503 may include display screen (display), keyboard (keyboard)
Deng.Memory 504 can be high-speed random access memory (random access memory, RAM), be also possible to non-volatile
Property memory (non-volatile memory), for example, at least a read-only memory (read-only memory, ROM).It deposits
Reservoir 504 optionally can also be that at least one is located remotely from the storage device of aforementioned processor 501.As shown in figure 5, as one
It may include operating system, network communication module, Subscriber Interface Module SIM and number in the memory 504 of kind computer storage medium
According to branching program.
Network interface 502 is mainly used for connecting client progress data communication;And processor 501 can be used for calling storage
The data processor stored in device 504, and execute following operation:
1) mantenance data that client is sent is received by network interface 502, and is converted to document to be sorted.
2) mantenance data for receiving network interface 502 is stored in Database Unit 505, will be in Database Unit 505
Mantenance data saves as document to be sorted.
4) the feature word set for obtaining document to be sorted, specifically includes: by text segmentation methods to the document to be sorted
It is segmented, obtains the word finder of the document to be sorted;The stop words in the word finder is deleted, the Feature Words are obtained
Collection.
5) according to the feature word set and vocabulary probability calculation theme probability, specifically include: by features described above word set and
In above-mentioned vocabulary probability input LDA model, above-mentioned theme probability is exported.
6) Training document collection is obtained from Database Unit 505, the set that above-mentioned Training document integrates as Training document is above-mentioned
Training document includes auto repair information;Obtain the training word set of above-mentioned Training document collection;Above-mentioned trained word set is above-mentioned training
The feature vocabulary sum aggregate of document sets is closed;By in above-mentioned trained word set and above-mentioned maintenance theme input LDA model, above-mentioned vocabulary is exported
Probability.
It is in embodiment in the application, the data processor stored in memory 504 includes that text segments relevant journey
Sequence, the above-mentioned calling of processor 501 above procedure treats classifying documents and Training document is segmented;In a kind of possible realization side
In formula, above-mentioned processor 501 can also call the data processor in memory 504, carry out to feature word set and training word set
Number;Above-mentioned processor 501 can also call the data processor in memory 504, construct the dictionary and feature of trained word set
The dictionary of word set.
In the embodiment of the present application, above-mentioned user interface 503 includes display screen and keyboard, for interacting with user;
Network communication module in above-mentioned memory is used to carry out network communication with client or server.
In the embodiment of the present application, above-mentioned processor 501 can call the data processing journey stored in above-mentioned memory 504
Sequence obtains maintenance theme;Optionally, above-mentioned processor 501 may call upon the data processing journey stored in above-mentioned memory 504
Theme collection, and the maintenance theme concentrated according to 1~Z to above-mentioned maintenance theme are repaired in same type of maintenance theme merger by sequence
It is numbered, Z is the length for repairing theme collection.
In the embodiment of the present application, above-mentioned processor 501 can call the data processing journey stored in above-mentioned memory 504
Sequence exports in features described above word set and above-mentioned vocabulary probability input LDA model above-mentioned theme probability, specifically includes following behaviour
Make:
1) random initializtion: the word for being w for number each in features described above word set, random fit one number is z's
Repair theme;
2) above-mentioned vocabulary probability is substituted into the parameter of gibbs sampler formula
3) the feature vocabulary in scanning feature word set samples feature according to gibbs sampler formula to each feature vocabulary w
Theme corresponding to vocabulary w, and the maintenance theme according to corresponding to sampled result update features described above vocabulary z;
4) step 2 is repeated until sampled result restrains, and sampled result convergence shows as the difference of Gibbs formula front and back twice
Value fluctuates in default range;
5) theme corresponding to statistical nature vocabulary, according to formulaCalculate the text of LDA model
Shelves-theme distribution parameterWherein, θm,kIndicate that k-th of theme is the probability of the theme of m documents, if document to be sorted is only
There is one, then m perseverance is 1;Indicate the counting for the theme that the document reference numeral that number is m is k;αkForDi Li Cray
Distribution parameter;K is the theme the theme number of concentration;
6) according to above-mentioned document-theme distribution parameterIt is determined as above-mentioned theme probability.
Above-mentioned processor 501 may call upon the data processor stored in above-mentioned memory 504, by above-mentioned trained word
In collection and above-mentioned maintenance theme input LDA model, above-mentioned vocabulary probability is exported, following operation is specifically included:
1) random initializtion: the word for being w for each number in training word set, the maintenance that random fit one number is z
Theme;
2) the training vocabulary in training word set is scanned, to each trained vocabulary w, according to gibbs sampler formula sample train
Maintenance theme corresponding to vocabulary w, and according to sampled result update above-mentioned trained vocabulary z corresponding to maintenance theme;
3) step 2 is repeated until sampled result restrains, and sampled result convergence shows as the difference of Gibbs formula front and back twice
Value fluctuates in default range;
4) the matched theme of vocabulary institute in training word set is counted, according to formulaIt calculates above-mentioned
Theme-vocabulary distribution parameterValue;Wherein,Indicate the feature vocabulary that the maintenance theme reference numeral that number is k is t
It counts;Indicate probability of the feature vocabulary in the theme that semantically expression number is k of number t;βtFor θtDi Li Cray point
Cloth parameter;V is the length of the dictionary of training word set;
5) by above-mentioned theme-vocabulary distribution parameterValue be determined as the value of above-mentioned vocabulary probability.
Above-mentioned processor 501 can also be used to that the data processor stored in memory 504 be called to determine above-mentioned maintenance
Theme whether be document to be sorted theme.
As it can be seen that the available document to be sorted of equipment and maintenance theme of above-mentioned determining auto repair document subject matter, mention
The feature word set of above-mentioned document to be sorted is taken, and according to feature word set and vocabulary probability calculation theme probability, according to theme probability
Determine whether above-mentioned maintenance theme is the theme of service document of getting on the car.Auto repair can be accurately identified by above equipment
The theme of document improves the efficiency for distinguishing auto repair document, saves the time of maintenance technician.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of above-described embodiment is can
It is completed with instructing relevant hardware by program, which can be stored in a computer readable storage medium, storage
Medium include read-only memory (read-only memory, ROM), random access memory (random access memory,
RAM), programmable read only memory (programmable read-only memory, PROM), erasable programmable is read-only deposits
Reservoir (erasable programmable read Only Memory, EPROM), disposable programmable read-only memory (one-
Time programmable read-Only Memory, OTPROM), the electronics formula of erasing can make carbon copies read-only memory
(electrically-erasable programmable read-only memory, EEPROM), CD-ROM (compact
Disc read-only memory, CD-ROM) or other disc memories, magnetic disk storage, magnetic tape storage or can
For carrying or any other computer-readable medium of storing data.
The evaluation method and equipment of a kind of Automobile Service Factory disclosed in the embodiment of the present application are described in detail above,
Specific examples are used herein to illustrate the principle and implementation manner of the present application, and the explanation of above embodiments is only used
The present processes and its core concept are understood in help;At the same time, for those skilled in the art, according to the application's
Thought, there will be changes in the specific implementation manner and application range, in conclusion the content of the present specification should not be construed as
Limitation to the application.
Claims (10)
1. a kind of method of determining auto repair document subject matter characterized by comprising
Obtain document to be sorted and maintenance theme, the document to be sorted include auto repair information, the maintenance theme be and
The relevant theme of auto repair;
The feature word set of the document to be sorted is obtained, the feature word set is the collection of the feature vocabulary of the document to be sorted
It closes;
According to the feature word set and vocabulary probability calculation theme probability, the vocabulary probability is that the Feature Words concentrate each spy
For sign vocabulary in the probability for semantically expressing the maintenance theme, the theme probability is that the document to be sorted corresponds to the dimension
Repair the probability of theme;
According to the theme probability, determine whether the maintenance theme is used as the theme of the document to be sorted.
2. method according to claim 1, which is characterized in that described to obtain document to be sorted and maintenance theme, comprising:
Receive mantenance data to be sorted;The document to be sorted is converted by the mantenance data.
3. method according to claim 1, which is characterized in that the feature word set for obtaining the document to be sorted, comprising:
The document to be sorted is segmented by text segmentation methods, obtains the word finder of the document to be sorted;
The stop words in the word finder is deleted, the feature word set is obtained.
4. method according to claim 1, which is characterized in that after acquisition document to be sorted and maintenance theme,
It is described according to the feature word set and vocabulary probability calculation theme probability before, further includes:
Training document collection, the set that the Training document integrates as Training document are obtained, the Training document is believed comprising auto repair
Breath;
Obtain the training word set of the Training document collection;The trained word set is the collection of the feature vocabulary of the Training document collection
It closes;
The trained word set and the maintenance theme are inputted in implicit Di Li Cray distribution LDA model, it is general to export the vocabulary
Rate.
5. according to claim 1 or 4 the methods, which is characterized in that described according to the feature word set and vocabulary probability meter
Calculate theme probability, comprising:
The feature word set and the vocabulary probability are inputted in implicit Di Li Cray distribution LDA model, it is general to export the theme
Rate.
6. a kind of equipment of determining auto repair document subject matter characterized by comprising
First acquisition unit, for obtaining document to be sorted and maintenance theme, the document to be sorted includes auto repair information,
The maintenance theme is theme relevant to auto repair;
Second acquisition unit, for obtaining the feature word set of the document to be sorted, the feature word set is the text to be sorted
The set of the feature vocabulary of shelves;
Theme probability calculation unit, for general according to the feature word set and vocabulary probability calculation theme probability, the vocabulary
Rate is that the Feature Words concentrate each feature vocabulary in the probability for semantically expressing the maintenance theme, and the theme probability is described
Document to be sorted corresponds to the probability of the maintenance theme;
Determination unit, for determining whether the maintenance theme is used as the master of the document to be sorted according to the theme probability
Topic.
7. equipment according to claim 6, which is characterized in that further include:
Third acquiring unit, for obtaining Training document collection, the set that the Training document integrates as Training document, the training text
Shelves include auto repair information;
4th acquiring unit, for obtaining the feature word set of the Training document collection, the trained word set is the Training document
The set of the feature vocabulary of collection;
Vocabulary probability calculation unit is distributed LDA for the trained word set and the maintenance theme to be inputted implicit Di Li Cray
In model, the vocabulary probability is exported.
8. according to claim 6 or 7 equipment, which is characterized in that the computing unit be specifically used for feature word set and
The vocabulary probability inputs in implicit Di Li Cray distribution LDA model, exports theme probability.
9. a kind of equipment of determining auto repair document subject matter, which is characterized in that including processor, memory, Database Unit,
Network interface, communication bus and user interface;The processor, the memory, the Database Unit, the network interface
It is connected with each other with the user interface by the communication bus;Wherein, the memory is described for storing computer program
Computer program includes program instruction, and the processor is configured for that described program is called to instruct, and execution such as claim 1~
6 a kind of described in any item determining auto repair document subject matter methods.
10. a kind of computer readable storage medium, which is characterized in that the computer-readable recording medium storage has computer journey
Sequence, the computer program include program instruction, and described program instruction executes the processor such as
A kind of described in any item methods of determining auto repair document subject matter of Claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811075837.7A CN109446318A (en) | 2018-09-14 | 2018-09-14 | A kind of method and relevant device of determining auto repair document subject matter |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811075837.7A CN109446318A (en) | 2018-09-14 | 2018-09-14 | A kind of method and relevant device of determining auto repair document subject matter |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109446318A true CN109446318A (en) | 2019-03-08 |
Family
ID=65532568
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811075837.7A Pending CN109446318A (en) | 2018-09-14 | 2018-09-14 | A kind of method and relevant device of determining auto repair document subject matter |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109446318A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110717038A (en) * | 2019-09-17 | 2020-01-21 | 腾讯科技(深圳)有限公司 | Object classification method and device |
CN113704471A (en) * | 2021-08-26 | 2021-11-26 | 唯品会(广州)软件有限公司 | Statement classification method, device, equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140278291A1 (en) * | 2013-03-14 | 2014-09-18 | Microsoft Corporation | Discovering functional groups |
CN107193892A (en) * | 2017-05-02 | 2017-09-22 | 东软集团股份有限公司 | A kind of document subject matter determines method and device |
US20180032600A1 (en) * | 2016-08-01 | 2018-02-01 | International Business Machines Corporation | Phenomenological semantic distance from latent dirichlet allocations (lda) classification |
CN107832298A (en) * | 2017-11-16 | 2018-03-23 | 北京百度网讯科技有限公司 | Method and apparatus for output information |
CN108399228A (en) * | 2018-02-12 | 2018-08-14 | 平安科技(深圳)有限公司 | Article sorting technique, device, computer equipment and storage medium |
-
2018
- 2018-09-14 CN CN201811075837.7A patent/CN109446318A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140278291A1 (en) * | 2013-03-14 | 2014-09-18 | Microsoft Corporation | Discovering functional groups |
US20180032600A1 (en) * | 2016-08-01 | 2018-02-01 | International Business Machines Corporation | Phenomenological semantic distance from latent dirichlet allocations (lda) classification |
CN107193892A (en) * | 2017-05-02 | 2017-09-22 | 东软集团股份有限公司 | A kind of document subject matter determines method and device |
CN107832298A (en) * | 2017-11-16 | 2018-03-23 | 北京百度网讯科技有限公司 | Method and apparatus for output information |
CN108399228A (en) * | 2018-02-12 | 2018-08-14 | 平安科技(深圳)有限公司 | Article sorting technique, device, computer equipment and storage medium |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110717038A (en) * | 2019-09-17 | 2020-01-21 | 腾讯科技(深圳)有限公司 | Object classification method and device |
CN113704471A (en) * | 2021-08-26 | 2021-11-26 | 唯品会(广州)软件有限公司 | Statement classification method, device, equipment and storage medium |
CN113704471B (en) * | 2021-08-26 | 2024-02-02 | 唯品会(广州)软件有限公司 | Sentence classification method, sentence classification device, sentence classification equipment and sentence classification storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108073568B (en) | Keyword extraction method and device | |
CN110020422B (en) | Feature word determining method and device and server | |
CN109815487B (en) | Text quality inspection method, electronic device, computer equipment and storage medium | |
Xie et al. | Detecting duplicate bug reports with convolutional neural networks | |
CN109872162B (en) | Wind control classification and identification method and system for processing user complaint information | |
CN104160392B (en) | Semantic estimating unit, method | |
US20150310096A1 (en) | Comparing document contents using a constructed topic model | |
WO2020073714A1 (en) | Training sample obtaining method, account prediction method, and corresponding devices | |
CN111797214A (en) | FAQ database-based problem screening method and device, computer equipment and medium | |
US20130060769A1 (en) | System and method for identifying social media interactions | |
CN107102993B (en) | User appeal analysis method and device | |
CN107844533A (en) | A kind of intelligent Answer System and analysis method | |
CN113743111B (en) | Financial risk prediction method and device based on text pre-training and multi-task learning | |
WO2024109619A1 (en) | Sensitive data identification method and apparatus, device, and computer storage medium | |
CN107729917A (en) | The sorting technique and device of a kind of title | |
CN109740642A (en) | Invoice category recognition methods, device, electronic equipment and readable storage medium storing program for executing | |
Wu et al. | Extracting topics based on Word2Vec and improved Jaccard similarity coefficient | |
CN103605691A (en) | Device and method used for processing issued contents in social network | |
Zhang et al. | Relation classification: Cnn or rnn? | |
CN102402717A (en) | Data analysis facility and method | |
CN110287409A (en) | A kind of webpage type identification method and device | |
CN110941702A (en) | Retrieval method and device for laws and regulations and laws and readable storage medium | |
CN112181490A (en) | Method, device, equipment and medium for identifying function category in function point evaluation method | |
CN109446318A (en) | A kind of method and relevant device of determining auto repair document subject matter | |
CN110532359A (en) | Legal provision query method, apparatus, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190308 |