CN109508378A - A kind of sample data processing method and processing device - Google Patents
A kind of sample data processing method and processing device Download PDFInfo
- Publication number
- CN109508378A CN109508378A CN201811421160.8A CN201811421160A CN109508378A CN 109508378 A CN109508378 A CN 109508378A CN 201811421160 A CN201811421160 A CN 201811421160A CN 109508378 A CN109508378 A CN 109508378A
- Authority
- CN
- China
- Prior art keywords
- keyword
- words
- bag
- word
- element set
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The embodiment of the present application discloses a kind of sample data processing method and processing device, this method is suitable for the machine learning model training of short text classification, this method comprises: obtaining the word segmentation result after short text sample is segmented, and obtain the keyword bag of words comprising N number of keyword, the first element set is determined according to the word segmentation result and the keyword bag of words, obtain the target word in the word segmentation result, and obtain the similarity value of each keyword in the target word and the keyword bag of words, when the similarity value of the first keyword in the target word and the keyword bag of words is greater than similarity threshold, the first element in first element set is then updated according to the similarity value of the target word and first keyword, obtain second element set, each element in the second element set is used to construct the machine learning model for short text classification.Using the embodiment of the present application, the performance of the machine learning model using element set building can be improved.
Description
Technical field
This application involves field of computer technology more particularly to a kind of sample data processing method and processing devices.
Background technique
With the development of the applications such as microblogging, social network sites and hotline, more and more information start with short text
Form is presented, and is in explosive growth.It is crucial that text classification can help people quickly and effectively to obtain from mass data
Information, and the accuracy of text classification depends on the performance of machine learning model, the performance of machine learning model again relies on sample
Notebook data.
Existing samples of text data processing method mostly uses greatly based on keyword bag of words (Bag of Words) model
Method, this method, which is used in long text, usually can obtain preferable effect, but be used in usually ineffective, quality in short text
It is low.Main cause is, compared to long text, short text has the characteristics that feature is sparse, theme is indefinite.Firstly, since short essay
The limitation of this length, Feature Words are seldom, and big with the sample data dimension that keyword bag of words generate, and which increase texts
The difficulty of processing.Secondly, relevant word would generally largely occur with theme in long text, entire chapter text can be thus judged
The main contents of chapter;And then main contents cannot be judged according to word frequency in short text.Such as short text " consulting shuttlecock master
In the dining room of topic ", " shuttlecock " is identical with the word frequency in " dining room ", it is apparent that the theme of the text is " dining room ", in text classification
When should be assigned to " food and drink " this classification rather than " movement " classification.It can be seen that existing sample data processing method cannot
Short text is indicated well.
To sum up, because short text has the characteristics that above-mentioned feature is sparse and theme is indefinite, utilize existing sample
Performance of the sample data building that notebook data processing method obtains for the machine learning model of text classification is poor, and text classification is quasi-
True property is low.
Summary of the invention
The embodiment of the present application provides a kind of sample data processing method and processing device, can increase the letter of short text sample data
Breath amount, improves the performance of the machine learning model using short text sample data building, and then improves the essence of short text classification
Exactness.
In a first aspect, the embodiment of the present application provides a kind of sample data processing method, this method comprises:
The word segmentation result obtained after short text sample is segmented is obtained, and obtains the keyword word comprising N number of keyword
Bag, it include at least one word in the word segmentation result;
The first element set is determined according to the word segmentation result and the keyword bag of words, includes N in first element set
A element, the value of each element is that each keyword goes out in the word segmentation result in the keyword bag of words in first element set
Existing number;
The target word in the word segmentation result is obtained, and obtains the phase of the target word with each keyword in the keyword bag of words
Like angle value, which includes the word for existing in the word segmentation result and being not present in the keyword bag of words;
If the similarity value of the first keyword in the target word and the keyword bag of words is greater than similarity threshold, basis
It is first yuan corresponding that the target word with the similarity value of first keyword updates first keyword in first element set
Element obtains second element set;
Wherein, each element in the second element set is used to construct the machine learning model for short text classification.
With reference to first aspect, in a kind of possible embodiment, the keyword bag of words comprising N number of keyword, packet are obtained
It includes: obtaining the training sample set for generating keyword bag of words;It is concentrated according to term frequency-inverse document frequency algorithm from the training sample
It determines N number of keyword, which is generated according to N number of keyword.
With reference to first aspect, in a kind of possible embodiment, first element set be primary vector, this first to
Include N number of element in amount, which meets:
Wherein, the V1Indicate the primary vector, it shouldIndicate n-th of keyword in keyword bag of words in the participle knot
The number occurred in fruit, the value range of the n are 1 natural number for arriving N.
With reference to first aspect, in a kind of possible embodiment, obtain the target word with it is each in the keyword bag of words
The similarity value of keyword, comprising: obtain the term vector of the target word from term vector database, and from the term vector database
The middle term vector for obtaining each keyword in the keyword bag of words;Calculate the term vector of the target word and the word of each keyword
Similarity value between vector;
If the similarity value of the first keyword in the target word and the keyword bag of words is greater than similarity threshold, basis
It is first yuan corresponding that the target word with the similarity value of first keyword updates first keyword in first element set
Element obtains second element set, comprising:
If the similarity value in the term vector of the target word and the keyword bag of words between the term vector of the first keyword is big
In similarity threshold, then the product for the number that the similarity value occurs in the word segmentation result with the target word is calculated;By this
Corresponding first element of first keyword is updated to the sum of the product and first element in one element set, obtains second yuan
Element set.
With reference to first aspect, in a kind of possible embodiment, after obtaining the second element set, this method is also
It include: to obtain corresponding M the second keyword of M 0 elements in the second element set;It obtains in the keyword bag of words at least
One third keyword, the number which occurs in the word segmentation result are greater than or equal to 1;According to the M second
Similarity value in keyword in each second keyword and the keyword bag of words between each third keyword replace this second
0 element in element set, obtains third element set.Wherein, each element in the third element set is used for constructing
In the machine learning model of short text classification.
With reference to first aspect, crucial according in the M the second keywords each second in a kind of possible embodiment
Similarity value in word and the keyword bag of words between each third keyword replaces 0 element in the second element set, obtains
To third element set, comprising:
From the term vector for obtaining each second keyword in the M the second keywords in term vector database, and from the word
The term vector of each third keyword of at least one third keyword in the keyword bag of words is obtained in vector data library;It obtains
Each third is crucial in the term vector and at least one third keyword of any second keyword i in the M the second keywords
Similarity value between the term vector of word;If the in the term vector of second keyword i and at least one third keyword
Similarity value between the term vector of three keyword m is maximum, then by second keyword i corresponding 0 in the second element set
Element replaces with the similarity value between the second keyword i and third keyword m with the third keyword m in the participle knot
The product of the number occurred in fruit obtains third element set.
With reference to first aspect, in a kind of possible embodiment, which is combined into third vector, the third to
Include N number of element in amount, which meets:
Wherein, the V3Indicate the third vector, the wmIndicate that m-th of keyword in keyword bag of words is third keyword,Indicate frequency of occurrence of m-th of keyword in the word segmentation result in keyword bag of words,The wiIndicate keyword word
I-th of keyword in bag is the second keyword, [cos (wm,wi)]maxIt indicates m-th of keyword in keyword bag of words and closes
The similarity value of i-th of keyword in keyword bag of words is maximum.
Second aspect, the embodiment of the present application provide a kind of sample data processing unit, which includes:
First obtains module, for obtaining the word segmentation result obtained after short text sample is segmented, and obtains comprising N number of
The keyword bag of words of keyword include at least one word in the word segmentation result;
Determining module, for determining the first element set according to the word segmentation result and the keyword bag of words, this first yuan
It include N number of element in element set, the value of each element is that each keyword exists in the keyword bag of words in first element set
The number occurred in the word segmentation result;
Second obtains module, for obtaining the target word in the word segmentation result, and obtains the target word and the keyword word
The similarity value of each keyword in bag, which, which is included in the word segmentation result, exists and does not deposit in the keyword bag of words
Word;
Update module, for working as the target word to the similarity value of the first keyword in the keyword bag of words greater than similar
When spending threshold value, first keyword in first element set is updated according to the similarity value of the target word and first keyword
Corresponding first element, obtains second element set;
Wherein, each element in the second element set is used to construct the machine learning model for short text classification.
It include first acquisition unit in the first acquisition module in a kind of possible embodiment in conjunction with second aspect
And second acquisition unit, the first acquisition unit is for obtaining the word segmentation result obtained after short text sample is segmented, this point
It include at least one word in word result;The second acquisition unit is for obtaining the keyword bag of words comprising N number of keyword.Wherein,
The second acquisition unit is specifically used for: obtaining the training sample set for generating keyword bag of words;According to term frequency-inverse document frequency
Algorithm determines N number of keyword from training sample concentration, generates the keyword bag of words according to N number of keyword.
In conjunction with second aspect, in a kind of possible embodiment, first element set be primary vector, this first to
Include N number of element in amount, which meets:
Wherein, the V1Indicate the primary vector, it shouldIndicate n-th of keyword in keyword bag of words in the participle knot
The number occurred in fruit, the value range of the n are 1 natural number for arriving N.
In conjunction with second aspect, in a kind of possible embodiment, this second acquisition module include third acquiring unit and
4th acquiring unit, the third acquiring unit are used to obtain the target word in the word segmentation result, which is included in the participle
As a result the word for existing in and being not present in the keyword bag of words;4th acquiring unit is for obtaining the target word and the key
The similarity value of each keyword in word bag of words.Wherein, the 4th acquiring unit is specifically used for: obtaining from term vector database
The term vector of the target word, and from the term vector for obtaining each keyword in the keyword bag of words in the term vector database;Meter
Calculate the similarity value between the term vector of the target word and the term vector of each keyword.
The update module is specifically used for: if in the term vector of the target word and the keyword bag of words the first keyword word to
Similarity value between amount is greater than similarity threshold, then calculates what the similarity value occurred in the word segmentation result with the target word
The product of number;Corresponding first element of first keyword in first element set is updated to the product and this first yuan
The sum of element obtains second element set.
In conjunction with second aspect, in a kind of possible embodiment, the device further include: third obtains module, for obtaining
Take corresponding M the second keyword of M 0 elements in the second element set;4th obtains module, for obtaining the keyword word
At least one third keyword in bag, the number which occurs in the word segmentation result are greater than or equal to 1;Replacement
Module, for according to each third keyword in each second keyword in the M the second keywords and the keyword bag of words it
Between similarity value replace 0 element in the second element set, obtain third element set.Wherein, the third element set
In each element be used for construct for short text classify machine learning model.
In conjunction with second aspect, in a kind of possible embodiment, which is specifically used for:
From the term vector for obtaining each second keyword in the M the second keywords in term vector database, and from the word
The term vector of each third keyword of at least one third keyword in the keyword bag of words is obtained in vector data library;It obtains
Each third is crucial in the term vector and at least one third keyword of any second keyword i in the M the second keywords
Similarity value between the term vector of word;If the in the term vector of second keyword i and at least one third keyword
Similarity value between the term vector of three keyword m is maximum, then by second keyword i corresponding 0 in the second element set
Element replaces with the similarity value between the second keyword i and third keyword m with the third keyword m in the participle knot
The product of the number occurred in fruit obtains third element set.
In conjunction with second aspect, in a kind of possible embodiment, which is combined into third vector, the third to
Include N number of element in amount, which meets:
Wherein, the V3Indicate the third vector, the wmIndicate that m-th of keyword in keyword bag of words is third keyword,Indicate frequency of occurrence of m-th of keyword in the word segmentation result in keyword bag of words,The wiIndicate keyword word
I-th of keyword in bag is the second keyword, [cos (wm,wi)]maxIt indicates m-th of keyword in keyword bag of words and closes
The similarity value of i-th of keyword in keyword bag of words is maximum.
The third aspect, the embodiment of the present application provide a kind of terminal, including processor and memory, the processor and storage
Device is connected with each other, wherein the memory is used to store the computer program for supporting terminal to execute the above method, the computer program
Including program instruction, which is configured for calling the program instruction, executes the sample data processing of above-mentioned first aspect
Method.
Fourth aspect, the embodiment of the present application provide a kind of computer readable storage medium, which deposits
Computer program is contained, which includes program instruction, which makes the processor when being executed by a processor
Execute the sample data processing method of above-mentioned first aspect.
The embodiment of the present application by obtain short text sample segmented after word segmentation result, and obtain include N number of key
The keyword bag of words of word determine the first element set according to the word segmentation result and the keyword bag of words, obtain the word segmentation result
In target word, and obtain the similarity value of each keyword in the target word and the keyword bag of words, when the target word with should
The similarity value of the first keyword in keyword bag of words is greater than similarity threshold, then according to the target word and first keyword
Similarity value update corresponding first element of first keyword in first element set, obtain second element set, should
Each element in second element set is used to construct the machine learning model for short text classification, can increase short text sample
Information content in this element set improves the property of the machine learning model constructed using the element set of the short text sample
Can, and then improve the accuracy of short text classification.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description
Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field
For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a schematic flow diagram of sample data processing method provided by the embodiments of the present application;
Fig. 2 is a schematic flow diagram of keyword bag of words generation method provided by the embodiments of the present application;
Fig. 3 is the schematic diagram for the entry that keyword " basketball " returns;
Fig. 4 is another schematic flow diagram of sample data processing method provided by the embodiments of the present application;
Fig. 5 is a schematic block diagram of sample data processing unit provided by the embodiments of the present application;
Fig. 6 is a schematic block diagram of terminal provided by the embodiments of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete
Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen
Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts
Example, shall fall in the protection scope of this application.
It should be appreciated that the description and claims of this application and term " first " in the attached drawing, " second ",
" third " etc. is not use to describe a particular order for distinguishing different objects.In addition, term " includes " and " having " and
Their any deformations, it is intended that cover and non-exclusive include.Such as contain a series of steps or units process, method,
System, product or equipment are not limited to listed step or unit, but optionally further comprising the step of not listing or
Unit, or optionally further comprising other step or units intrinsic for these process, methods, product or equipment.
It is also understood that referenced herein " embodiment " it is meant that describe in conjunction with the embodiments special characteristic, structure or
Characteristic may be embodied at least one embodiment of the application.Each position in the description shows that the phrase might not
Identical embodiment is each meant, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art
Member explicitly and implicitly understands that embodiment described herein can be combined with other embodiments.
It will be further appreciated that the term "and/or" used in present specification and the appended claims is
Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
Below in conjunction with Fig. 1 to Fig. 6, sample data processing method and processing device provided by the embodiments of the present application is said
It is bright.
It is a schematic flow diagram of sample data processing method provided by the embodiments of the present application referring to Fig. 1.As shown in Figure 1,
The sample data processing method may include step:
S101, terminal obtains the word segmentation result obtained after short text sample is segmented, and obtains comprising N number of keyword
Keyword bag of words.
In some possible embodiments, the short text sample of the available user's input of terminal, and base can be used
Segmenting method in string matching or the machine learning algorithm based on statistics etc. segment the short text sample, are divided
Word segmentation result after word.It may include at least one word in the word segmentation result.Terminal can be obtained from keyword bag of words database
Taking any one includes the keyword bag of words of N number of keyword.N can be the integer greater than 1.It can in the keyword bag of words database
To include the keyword bag of words of preset multiple classifications.Each keyword bag of words in the keyword bag of words database include N
A keyword.The keyword bag of words can be used to indicate that the set for ignoring the word of word order, grammer and syntax in text.For example, short
Samples of text is " today, how is weather? ", the word segmentation result after participle is " today/weather/how " these three words.
Wherein, the message length of short text sample is all shorter, usually within 200 words, therefore the effective letter for being included
Breath is also considerably less, usually has the characteristics that sparsity, real-time, scrambling.Segmenting method based on string matching can be with
Including Forward Maximum Method method, reverse maximum matching method, bi-directional matching participle method etc., the machine learning algorithm based on statistics can be with
Including hidden Markov model (hidden markov model, HMM), condition random field (conditional random
Fields, CRF) etc..
In some possible embodiments, the mode that terminal obtains keyword bag of words may refer to Fig. 2.Fig. 2 is this Shen
Please embodiment provide keyword bag of words generation method a schematic flow diagram.As shown in Fig. 2, the keyword of the embodiment of the present application
Bag of words generation method may include step:
S1011 obtains the training sample set for generating keyword bag of words.
In some possible embodiments, terminal can know the upper predetermined different classes of pass of input in Baidu
Keyword scans for, and can crawl the title that the different classes of keyword returns, such as search weather class keywords " weather ",
" light rain " etc.;Move class keywords " basketball ", " baseball ", " running " etc..Terminal can will know the mark for swashing and getting in Baidu
Topic carries out manual sort and mark, obtains the corresponding training sample of each classification.Terminal can be by the corresponding training of each classification
Sample is added in the set of training sample, forms training sample set.Wherein, a classification can correspond to a training sample, should
It may include same category of multiple titles (sample) in training sample.
For example, as shown in figure 3, being the schematic diagram for the entry that keyword " basketball " returns.Wherein, basketball is the pass for moving class
One of keyword knows that upper input keyword " basketball " scans in Baidu, and terminal crawls preceding the 50 of keyword " basketball " return
Title.Title " diameter of a basketball has much " as shown in Figure 3, " specification of basketball? ", " the standard size of basketball court
It is how many? " etc..
S1012 determines N number of keyword from training sample concentration according to term frequency-inverse document frequency algorithm, according to N number of pass
Keyword generates keyword bag of words.
In some possible embodiments, terminal can use participle tool (such as jieba participle,
StanfordCoreNlp participle etc.) each sample (i.e. title) in each training sample of above-mentioned training sample set is carried out
Participle.For the training sample of each classification, terminal can use term frequency-inverse document frequency (term frequency-
Inverse document frequency, TF-IDF) algorithm calculates each sample in the training sample and obtains after being segmented
Each word TF-IDF value, and TF-IDF can be obtained by the TF-IDF value of the calculated category by sorting from large to small
Sequence, and the corresponding N number of word of top n TF-IDF value can be extracted from the TF-IDF sequence, then can carry out to N number of word
Artificial additions and deletions obtain the keyword bag of words comprising N number of keyword.N is the integer greater than 1.Terminal can preset a keyword word
Bag database, the keyword bag of words database can be used for storing the keyword bag of words of at least one classification.The keyword bag of words
Each keyword bag of words in database include N number of keyword.
Wherein, the value of TF-IDF is equal to TFwWith IDFwProduct, TFwAnd IDFwMeet respectively:
Entry w can be used to indicate that any word that each sample obtains after being segmented in training sample, corpus can be used
In the set for indicating all training samples, i.e. training sample set, number of files can be used to indicate that the number of training sample, number of files
It can be equal with the classification number of training sample.It should be noted that the total number of documents of corpus is greater than 2 in the embodiment of the present application,
That is the training sample of at least 3 classifications of training sample concentration.
For example, it is assumed that crawling the training sample of " weather, credit card, air ticket " three classifications.N=100.Due to each class
Other keyword bag of words generating mode is consistent, therefore by taking the keyword bag of words of generation " weather " classification as an example.Terminal utilizes first
Jieba participle tool or stanfordCoreNlp participle tool segment all samples (title) in " weather " classification,
Further according to TF-IDF formula calculate " weather " classification sample segmented after the obtained TF-IDF value of each word, will calculate
The TF-IDF value of " weather " classification out filters out in " weather " classification TF-IDF value preceding 100 by sorting from large to small
Keyword bag of words of the word as " weather " classification.Wherein, the corpus in TF-IDF is by " weather ", " credit card " and " air ticket "
The training sample composition of three classifications, entry w are any word obtained after the training sample of " weather " classification is segmented, document
For training sample.IDF at this timewThe total number of documents of middle corpus is 3, respectively " weather ", " credit card " and " air ticket " three classes
Other training sample, the number of files comprising entry w are 1, the i.e. training sample of " weather " classification.
It since the keyword in traditional keyword bag of words is more, for example include 1000 keys in a keyword bag of words
Word, and short text length is shorter (usually within 200 words), therefore for short text, it is raw with traditional keyword bag of words
At vector dimension it is excessive.Terminal in the embodiment of the present application constructs lesser keyword bag of words (packet using TF-IDF algorithm
Include 100 keywords), greatly avoid the excessive problem of dimension when generating vector using keyword bag of words.
S102, terminal determine the first element set according to word segmentation result and keyword bag of words.
In some possible embodiments, terminal can count in above-mentioned keyword bag of words each keyword above-mentioned short
The number occurred in the word segmentation result of samples of text.Terminal can be by each keyword in the keyword bag of words in the word segmentation result
The number of middle appearance is as the element in the first element set.It wherein, may include N number of element in first element set.
In some possible embodiments, above-mentioned first element set can be primary vector, can in the primary vector
To include N number of element.The primary vector meets:
Wherein, V1Indicate primary vector,Indicate n-th of word in keyword bag of words in the participle knot of short text sample
The number occurred in fruit, the value range of n are 1 natural number for arriving N.
It is " basketball/weather/football/running/today " respectively for example, it is assumed that there is 5 keywords in keyword bag of words, that
Primary vector V1In include 5 elements.Assuming that second keyword " weather " and the 5th keyword in keyword bag of words
" today " respectively occurs once in the word segmentation result of short text sample, and terminal determines primary vector V1In second element and
Five elements are 1.And other keyword (i.e. first keyword " basketball ", third keyword " foot in keyword bag of words
Ball " and the 4th keyword " running ") number that occurs in the word segmentation result of short text sample is 0, terminal determines first
Vector V1In first, third, the 4th element be all 0, primary vector V at this time1=[0,1,0,0,1].
S103, terminal obtain the target word in word segmentation result, and obtain each keyword in target word and keyword bag of words
Similarity value.
In some possible embodiments, terminal can search whether there is in the word segmentation result of above-mentioned short text sample
Target word.If the either objective there are at least one target word in the word segmentation result, in the available word segmentation result of terminal
Word.Terminal can calculate the similarity value between each keyword in the target word and above-mentioned keyword bag of words, as cosine value,
Euclidean distance etc..If target word is not present in the word segmentation result, above-mentioned first element set can be determined as second by terminal
Element set.Wherein, which, which can be used to indicate that, exists in the word segmentation result and does not deposit in above-mentioned keyword bag of words
Word.
For example, it is assumed that including 5 words, respectively C1, C2, C3, C4 and C5 in the word segmentation result of short text sample.Terminal
Detecting respectively whether there is keyword identical with word C1, C2, C3, C4, C5 in keyword bag of words.If not deposited in keyword bag of words
In keyword identical with word C2 and C5, then word C2 and C5 are determined as target word respectively by terminal.Terminal obtain target word C2 and
Either objective word such as C2 in C5, terminal calculate the similarity value in target word C2 and keyword bag of words between each keyword.
S104, if the similarity value of the first keyword in target word and keyword bag of words is greater than similarity threshold, eventually
It holds according to target word the first element corresponding with the first keyword in the similarity value of the first keyword the first element set of update,
Obtain second element set.
In some possible embodiments, terminal each pass in getting above-mentioned target word and above-mentioned keyword bag of words
After similarity value between keyword, can detecte the similarity value between the target word and each keyword whether be greater than it is pre-
If similarity threshold.When detect the similarity value between the target word and the first keyword in the key bag of words be greater than the phase
When like degree threshold value, illustrate the target word and first keyword is synonym, then terminal can the target word and first key
The similarity value of word updates corresponding first element of first keyword in above-mentioned first element set, obtains second element collection
It closes.When detect the similarity value in the target word and the key bag of words between each keyword be respectively less than or be equal to the similarity
When threshold value, illustrate there is no the synonym of the target word in the keyword bag of words, then in the available above-mentioned word segmentation result of terminal under
Similarity value in one target word and the keyword bag of words between each keyword.If each target in above-mentioned word segmentation result
Similarity value in word and the keyword bag of words between each keyword is respectively less than or is equal to the similarity threshold, illustrates the key
There is no the synonym of either objective word in word bag of words, then above-mentioned first element set can be determined as second element collection by terminal
It closes.Each element in the second element set can be used for constructing the machine learning model for short text classification.
Due to the keyword negligible amounts of keyword bag of words in the embodiment of the present application, so as to cause keyword bag of words are being utilized
When determining the first element set with the word segmentation result of short text sample, the partial information in short text sample is lost, that is, is occurred short
Word present in the word segmentation result of samples of text not in keyword bag of words situation namely short text sample in there is target word
The case where.Again because each element in the first element set is time that each keyword occurs in word segmentation result in keyword bag of words
Number, then the target word in word segmentation result does not embody in the first element set.So the embodiment of the present application by target word with
The similarity value of each keyword in keyword bag of words finds the synonym of target word in keyword bag of words, updates target word
Synonym in the first element set corresponding element, obtain second element set.To increase in the first element set
Information content, solve the problems, such as because keyword bag of words diminution and lose information, and then utilize second element set structure
When building the conventional machines learning model for short text classification, the machine learning model of available better performances, so that short essay
This classification is more acurrate.
In the embodiment of the present application, the word segmentation result after terminal is segmented by acquisition short text sample, and obtain packet
Keyword bag of words containing N number of keyword determine the first element set according to the word segmentation result and the keyword bag of words, and obtaining should
Target word in word segmentation result, and the similarity value of each keyword in the target word and the keyword bag of words is obtained, when the mesh
The similarity value for marking the first keyword in word and the keyword bag of words is greater than similarity threshold, then according to the target word and this
The similarity value of one keyword updates corresponding first element of first keyword in first element set, obtains second element
Gather, each element in the second element set is used to construct the machine learning model for short text classification, can increase
Information content in the element set of short text sample improves the machine learning mould constructed using the element set of the short text sample
The performance of type, and then improve the accuracy of short text classification.
It referring to fig. 4, is another schematic flow diagram of sample data processing method provided by the embodiments of the present application.Such as Fig. 4 institute
Show, which may include step:
S401, terminal obtains the word segmentation result obtained after short text sample is segmented, and obtains comprising N number of keyword
Keyword bag of words.
S402, terminal determine the first element set according to word segmentation result and keyword bag of words.
In some possible embodiments, the step S401- step S402 in the embodiment of the present application can refer to Fig. 1 institute
Show the implementation of the step S101- step S102 in embodiment, details are not described herein.
S403, terminal obtain the target word in word segmentation result.
S404, terminal obtain the term vector of target word from term vector database, and obtain and close from term vector database
The term vector of each keyword in keyword bag of words.
S405, terminal calculate the similarity value between the term vector of target word and the term vector of each keyword.
S406, if the similarity value in the term vector of target word and keyword bag of words between the term vector of the first keyword is big
In similarity threshold, then the product for the number that terminal calculating similarity value and target word occur in word segmentation result.
Corresponding first element of first keyword in first element set is updated to product and the first element by S407, terminal
The sum of, obtain second element set.
In some possible embodiments, terminal can search whether there is in the word segmentation result of above-mentioned short text sample
Target word.If the either objective there are at least one target word in the word segmentation result, in the available word segmentation result of terminal
Word.Terminal can obtain the term vector of the target word from term vector database, and can obtain from the term vector database
The term vector of each keyword in the keyword bag of words.Terminal can calculate the term vector and each keyword of the target word
Similarity value (such as cosine value, Euclidean distance) between term vector.The term vector that terminal can detecte the target word is each with this
Whether the similarity value between the term vector of a keyword is greater than preset similarity threshold.When detect the word of the target word to
When similarity value in amount and the key bag of words between the term vector of the first keyword is greater than the similarity threshold, illustrate the target
Word and first keyword are synonym, then terminal can calculate the similarity value and the target word in above-mentioned short text sample
The product of the number occurred in word segmentation result.Terminal can be by first keyword corresponding first in above-mentioned first element set
Element is updated to the sum of the product and first element, obtains second element set.Wherein, which can be used to indicate that
The word for existing in the word segmentation result and being not present in above-mentioned keyword bag of words.
In some possible embodiments, if target word is not present in the word segmentation result, terminal can be by above-mentioned the
One element set is determined as second element set.Alternatively, when detect the term vector of the target word with it is each in the key bag of words
When similarity value between the term vector of keyword is respectively less than or is equal to the similarity threshold, illustrate do not have in the keyword bag of words
The synonym of the target word, then in the available above-mentioned word segmentation result of terminal next target word term vector Yu the keyword word
Similarity value in bag between the term vector of each keyword.If the term vector of each target word and the pass in above-mentioned word segmentation result
Similarity value in keyword bag of words between the term vector of each keyword is respectively less than or is equal to the similarity threshold, illustrates the key
There is no the synonym of either objective word in word bag of words, then above-mentioned first element set can be determined as second element collection by terminal
It closes.
In some possible embodiments, the generation method of term vector database can be with are as follows: (1) terminal is from wikipedia
In crawl on a large scale without mark corpus (probably in 10 G or so), and to these without mark corpus segment, after participle
These corpus are inputted in continuous keyword bag of words (continuous bag of words, CBOW) and are trained.(2) exist
After CBOW model training, the term vector of all words in these corpus of terminal available CBOW model output, and can be with
The term vector of words all in these corpus is stored into term vector database.
In some possible embodiments, above-mentioned second element set can be secondary vector, can in the secondary vector
To include N number of element.The secondary vector meets:
Wherein, V2Indicate secondary vector, wjIndicate the target word in the word segmentation result of short text sample, wkIndicate keyword
K-th of keyword in bag of words is the first keyword, cos (wk,wj) indicate target word wjTerm vector and keyword bag of words in the
One keyword wkTerm vector between cosine value, cos (wk,wj) it is greater than preset similarity threshold such as 0.7, i.e. wjWith wkIt is same
Adopted word.Terminal is by primary vector V1In the first elementIt is updated to Indicate that target word exists
The number occurred in the word segmentation result of short text sample,Indicate the first keyword in keyword bag of words in short text sample
Word segmentation result in the number that occurs.
For example, it is assumed that having 2 target words, respectively target word w in the word segmentation result of short text samplej1With target word wj2。
Assuming that having 10 keywords, respectively w in keyword bag of words1,w2,w3,...,w10.Preset similarity threshold is 0.7.Terminal
Calculate separately target word wj1Term vector and keyword bag of words in 10 keyword w1,w2,w3,...,w10Term vector between
Cosine value.Assuming that target word wj1Term vector and the 4th keyword w4Term vector between cosine value cos (w4,wj1)=
0.8 is greater than preset similarity threshold 0.7, and terminal calculates cosine value cos (w4,wj1) and target word wj1Occur in word segmentation result
NumberProductTerminal is by primary vector V1In the 4th elementIt is updated to the 4th
ElementWith productThe sum of, i.e.,Terminal calculates separately target word wj2's
10 keyword w in term vector and keyword bag of words1,w2,w3,...,w10Term vector between cosine value.Assuming that target word
wj2Term vector and third keyword w3, main points word w6, the 7th keyword w7Term vector between cosine value it is equal
Greater than similarity threshold 0.7, i.e. cos (w3,wj2)、cos(w6,wj2) and cos (w7,wj2) value be all larger than 0.7.Terminal difference is more
New primary vector V1In third element6th elementWith the 7th element
S408, terminal obtain corresponding M the second keyword of M 0 elements in second element set.
In some possible embodiments, terminal can detecte in above-mentioned second element set with the presence or absence of 0 element.If
There are 0 element in the second element set, then the M in the available above-mentioned second element set of terminal 0 elements, and can will
The corresponding keyword of each 0 element is determined as the second keyword in the second element set, to obtain M the second keywords.
If any 0 element is not present in the second element set, above-mentioned second element set can be determined as third element by terminal
Set.Wherein, M can be the integer more than or equal to 1.0 element in second element set can be used to indicate that above-mentioned key
The number that some keyword in word bag of words occurs in above-mentioned word segmentation result is 0 and the synonym of the keyword is also at above-mentioned point
The number occurred in word result is 0, i.e. the second keyword do not occurred in the word segmentation result and second keyword it is synonymous
Word did not also occur in the word segmentation result.
In some possible embodiments, terminal can appoint from above-mentioned keyword bag of words takes a keyword, and can
To detect whether the keyword in the keyword bag of words takes.If the keyword in the detection keyword bag of words does not take, inspection
It surveys whether the number that the keyword occurs in above-mentioned word segmentation result is 0, occurs in above-mentioned word segmentation result when the keyword
When number is 0, the similarity value in the keyword and the word segmentation result between each word is calculated, when the keyword and the participle knot
When similarity value in fruit between each word is respectively less than or is equal to preset similarity threshold, illustrate in the word segmentation result without being somebody's turn to do
The keyword can be then determined as the second keyword by the synonym of keyword.It to be closed if there is no second in the keyword bag of words
Above-mentioned second element set then can be determined as third element set by keyword.
S409, terminal obtain at least one third keyword in keyword bag of words.
S410, according to each third keyword in each second keyword in M the second keywords and keyword bag of words it
Between similarity value replacement second element set in 0 element, obtain third element set.
In some possible embodiments, terminal can obtain at least one third key from above-mentioned keyword bag of words
Word, the number which occurs in above-mentioned word segmentation result are greater than or equal to 1.Terminal can calculate above-mentioned M second
Similarity value (such as cosine in keyword in each second keyword and above-mentioned keyword bag of words between each third keyword
Value, Euclidean distance etc.), and 0 element in above-mentioned second element can be replaced according to the similarity value, obtain third element collection
It closes.Each element in the third element set is used to construct the machine learning model for short text classification.Because of engineering
The cardinal principle for practising model is finally to export the element set multiplied by a weight on each element in element set and belong to
Different classes of probability.If there are 0 elements in element set, then the multiplied by weight of 0 element and arbitrary value, can still obtain 0, this
When machine learning model output probability may be 0, so as to cause in the presence of the element set of 0 element construct be used for short text
The machine learning model accuracy of classification is not high.So the embodiment of the present application is by replacing 0 yuan in second element set
Element is not present 0 element, can improve machine learning model while guaranteeing element set information content that is, in third element set
Accuracy, thus improve short text classification accuracy.
In some possible embodiments, terminal is after getting at least one third keyword, can be from above-mentioned
The term vector of each second keyword of above-mentioned M the second keywords is obtained in term vector database, and can be from the term vector
The term vector of each third keyword of at least one third keyword in above-mentioned keyword bag of words is obtained in database.For this
Any second keyword i in M the second keywords, the term vector of available second keyword i of terminal and this at least one
Similarity value (such as cosine value, Euclidean distance) in third keyword between the term vector of each third keyword.If this
Similarity value between third keyword m in the term vector of two keyword i and at least one third keyword is maximum, then eventually
The number for holding the available third keyword m to occur in above-mentioned word segmentation result.Terminal can calculate second keyword i's
The product of similarity value between term vector and third keyword m and the number, and can will be in above-mentioned second element set
Corresponding 0 element of second keyword i replaces with the product.Terminal is by the second keywords pair of the M in the second element set
Available third element set after M 0 elements answered all are replaced.Each element in the third element set is for constructing
Machine learning model for short text classification.
In some possible embodiments, above-mentioned third element set can be third vector, can in the third vector
To include N number of element.The third vector meets:
Wherein, V3Indicate third vector, wmIndicate that m-th of keyword in keyword bag of words is third keyword,Table
Show the number that m-th of keyword occurs in the word segmentation result of short text sample in keyword bag of words,The wiIt indicates
I-th of keyword in keyword bag of words is the second keyword, [cos (wm,wi)]maxIndicate m-th of pass in keyword bag of words
The similarity value of i-th of keyword in keyword and keyword bag of words is maximum.
For example, it is assumed that the 5th element in second element set is 0, the 5th element is corresponding crucial in second element set
Word is w5(i.e. the second keyword).Assuming that there are 3 third keywords, respectively w in keyword bag of words3、w6And w12.Terminal
Calculate separately the second keyword w5Term vector and each third keyword w3、w6And w12Term vector between cosine value.It is false
If cos (w3,w5)=0.3, cos (w6,w5)=0.7, cos (w12,w5)=0.5, due to the second keyword w5With third keyword
w6Between cosine value it is maximum, then terminal obtains the number that third keyword occurs in word segmentation resultWhereinEventually
End calculates cosine value cos (w6,w5) and the numberProductTerminal is by second element set
5 elementsReplace with the product
In the embodiment of the present application, the term vector of target word and pass in word segmentation result of the terminal by calculating short text sample
Similarity value in keyword bag of words between the term vector of each keyword obtains second to be updated to the first element set
Element set.The term vector of third keyword in term vector and keyword bag of words further according to the second keyword in keyword bag of words
Between similarity value replace 0 element in second element set, obtain third element set.In the third element set
Each element is used to construct the machine learning model for short text classification.It can be mentioned while guaranteeing element set information content
The accuracy of high machine learning model, to improve the accuracy of short text classification.
It is a schematic block diagram of sample data processing unit provided by the embodiments of the present application referring to Fig. 5.As shown in figure 5,
Sample data processing unit in the embodiment of the present application includes:
First obtains module 10, includes N for obtaining the word segmentation result obtained after short text sample is segmented, and obtaining
The keyword bag of words of a keyword include at least one word in the word segmentation result;
Determining module 20, for determining the first element set according to the word segmentation result and the keyword bag of words, this first
It include N number of element in element set, the value of each element is each keyword in the keyword bag of words in first element set
The number occurred in the word segmentation result;
Second obtains module 30, for obtaining the target word in the word segmentation result, and obtains the target word and the keyword
The similarity value of each keyword in bag of words, the target word be included in the word segmentation result exist and in the keyword bag of words not
Existing word;
Update module 40, for being greater than phase when the target word and the similarity value of the first keyword in the keyword bag of words
When like degree threshold value, first key in first element set is updated according to the similarity value of the target word and first keyword
Corresponding first element of word, obtains second element set;
Wherein, each element in the second element set is used to construct the machine learning model for short text classification.
It in some possible embodiments, include that first acquisition unit 101 and second obtains in the first acquisition module 10
Take unit 102.The first acquisition unit 101 is for obtaining the word segmentation result obtained after short text sample is segmented, the participle
It as a result include at least one word in;The second acquisition unit 102 is for obtaining the keyword bag of words comprising N number of keyword.
Wherein, which is specifically used for: obtaining the training sample set for generating keyword bag of words;Root
N number of keyword is determined from training sample concentration according to term frequency-inverse document frequency algorithm, which is generated according to N number of keyword
Keyword bag of words.
In some possible embodiments, which is primary vector, includes N number of member in the primary vector
Element, the primary vector meet:
Wherein, the V1Indicate the primary vector, it shouldIndicate n-th of keyword in keyword bag of words in the participle knot
The number occurred in fruit, the value range of the n are 1 natural number for arriving N.
In some possible embodiments, which includes that third acquiring unit 301 and the 4th obtain
Unit 302.The third acquiring unit 301 is used to obtain the target word in the word segmentation result, which is included in the participle knot
The word for existing in fruit and being not present in the keyword bag of words;4th acquiring unit 302 is for obtaining the target word and the pass
The similarity value of each keyword in keyword bag of words.
Wherein, the 4th acquiring unit 302 is specifically used for: the term vector of the target word is obtained from term vector database,
And from the term vector for obtaining each keyword in the keyword bag of words in the term vector database;Calculate the term vector of the target word
Similarity value between the term vector of each keyword.The update module 40 is specifically used for: if the term vector of the target word
It is greater than similarity threshold with the similarity value in the keyword bag of words between the term vector of the first keyword, then calculates the similarity
The product for the number that value and the target word occur in the word segmentation result;First keyword in first element set is corresponding
The first element be updated to the sum of the product and first element, obtain second element set.
In some possible embodiments, the device further include third obtain module the 50, the 4th obtain module 60 and
Replacement module 70.
The third obtains module 50, for obtaining corresponding M the second keyword of M 0 elements in the second element set;
4th obtains module 60, and for obtaining at least one third keyword in the keyword bag of words, the third keyword is at this
The number occurred in word segmentation result is greater than or equal to 1;The replacement module 70, for according in the M the second keywords each the
Similarity value in two keywords and the keyword bag of words between each third keyword replaces 0 yuan in the second element set
Element obtains third element set.Wherein, each element in the third element set is used to construct the machine for short text classification
Device learning model.
In some possible embodiments, which is specifically used for:
From the term vector for obtaining each second keyword in the M the second keywords in term vector database, and from the word
The term vector of each third keyword of at least one third keyword in the keyword bag of words is obtained in vector data library;It obtains
Each third is crucial in the term vector and at least one third keyword of any second keyword i in the M the second keywords
Similarity value between the term vector of word;If the in the term vector of second keyword i and at least one third keyword
Similarity value between the term vector of three keyword m is maximum, then by second keyword i corresponding 0 in the second element set
Element replaces with the similarity value between the second keyword i and third keyword m with the third keyword m in the participle knot
The product of the number occurred in fruit obtains third element set.
In some possible embodiments, which is combined into third vector, includes N number of member in the third vector
Element, the third vector meet:
Wherein, the V3Indicate the third vector, the wmIndicate that m-th of keyword in keyword bag of words is third keyword,Indicate frequency of occurrence of m-th of keyword in the word segmentation result in keyword bag of words,The wiIndicate keyword word
I-th of keyword in bag is the second keyword, [cos (wm,wi)]maxIt indicates m-th of keyword in keyword bag of words and closes
The similarity value of i-th of keyword in keyword bag of words is maximum.
In the specific implementation, above-mentioned sample data processing unit can execute above-mentioned Fig. 1, Fig. 2 or Fig. 4 by above-mentioned modules
Implementation provided by each step in provided implementation realizes the function of being realized in the various embodiments described above, tool
Body can be found in the corresponding description that each step provides in above-mentioned Fig. 1, Fig. 2 or embodiment of the method shown in Fig. 4, no longer superfluous herein
It states.
In the embodiment of the present application, the participle knot after sample data processing unit is segmented by acquisition short text sample
Fruit, and the keyword bag of words comprising N number of keyword are obtained, the first element is determined according to the word segmentation result and the keyword bag of words
Set, obtain the target word in the word segmentation result, and obtain the target word in the keyword bag of words each keyword it is similar
Angle value, when the similarity value of the first keyword in the target word and the keyword bag of words is greater than similarity threshold, then basis should
Target word and the similarity value of first keyword update corresponding first element of first keyword in first element set,
Second element set is obtained, each element in the second element set is used to construct the machine learning mould for short text classification
Type can increase the information content in the element set of short text sample, improve and constructed using the element set of the short text sample
Machine learning model performance, and then improve short text classification accuracy.
It is a schematic block diagram of terminal provided by the embodiments of the present application referring to Fig. 6.As shown in fig. 6, the embodiment of the present application
In terminal may include: one or more processors 601 and memory 602.Above-mentioned processor 601 and memory 602 pass through
Bus 603 connects.For memory 602 for storing computer program, the computer program includes program instruction, processor 601
For executing the program instruction of the storage of memory 602.Wherein, processor 601 is configured for calling described program instruction execution:
The word segmentation result obtained after short text sample is segmented is obtained, and obtains the keyword word comprising N number of keyword
Bag, it include at least one word in the word segmentation result;
The first element set is determined according to the word segmentation result and the keyword bag of words, includes N in first element set
A element, the value of each element is that each keyword goes out in the word segmentation result in the keyword bag of words in first element set
Existing number;
The target word in the word segmentation result is obtained, and obtains the phase of the target word with each keyword in the keyword bag of words
Like angle value, which includes the word for existing in the word segmentation result and being not present in the keyword bag of words;
If the similarity value of the first keyword in the target word and the keyword bag of words is greater than similarity threshold, basis
It is first yuan corresponding that the target word with the similarity value of first keyword updates first keyword in first element set
Element obtains second element set;
Wherein, each element in the second element set is used to construct the machine learning model for short text classification.
It should be appreciated that in the embodiment of the present application, alleged processor 601 can be central processing unit (Central
Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital
Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit,
ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic
Device, discrete gate or transistor logic, discrete hardware components etc..General processor can be microprocessor or this at
Reason device is also possible to any conventional processor etc..
The memory 602 may include read-only memory and random access memory, and to processor 601 provide instruction and
Data.The a part of of memory 602 can also include nonvolatile RAM.For example, memory 602 can also be deposited
Store up the information of device type.
In the specific implementation, sample provided by the embodiments of the present application can be performed in processor 601 described in the embodiment of the present application
The realization side of keyword bag of words generation method provided by the embodiments of the present application also can be performed in the implementation of notebook data processing method
Formula can also carry out the implementation of sample data processing unit described in the embodiment of the present application, and details are not described herein.
The embodiment of the present application also provides a kind of computer readable storage medium, which has meter
Calculation machine program, the computer program include program instruction, which realizes Fig. 1, sample shown in Fig. 4 when being executed by processor
Notebook data processing method or keyword bag of words generation method shown in Fig. 2, detail please refer to real shown in Fig. 1, Fig. 2 or Fig. 4
The description of example is applied, details are not described herein.
Above-mentioned computer readable storage medium can be sample data processing unit or electricity described in aforementioned any embodiment
The internal storage unit of sub- equipment, such as the hard disk or memory of electronic equipment.The computer readable storage medium is also possible to this
The plug-in type hard disk being equipped on the External memory equipment of electronic equipment, such as the electronic equipment, intelligent memory card (smart
Media card, SMC), secure digital (secure digital, SD) card, flash card (flash card) etc..Further,
The computer readable storage medium can also both including the electronic equipment internal storage unit and also including External memory equipment.It should
Computer readable storage medium is for other programs and data needed for storing the computer program and the electronic equipment.The meter
Calculation machine readable storage medium storing program for executing can be also used for temporarily storing the data that has exported or will export.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure
Member and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware
With the interchangeability of software, each exemplary composition and step are generally described according to function in the above description.This
A little functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Specially
Industry technical staff can use different methods to achieve the described function each specific application, but this realization is not
It is considered as beyond scope of the present application.
The application is referring to the method, apparatus of the embodiment of the present application and the flow chart and/or box of computer program product
Figure describes.It should be understood that each process and/or the side in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in frame and flowchart and/or the block diagram.These computer program instructions be can provide to logical
With the processor of the processing equipments of computer, special purpose computer, Embedded Processor or other programmable diagnosis and treatment data to generate
One machine, so that generating use by the instruction that the processor of computer or the processing equipment of other programmable diagnosis and treatment data executes
In the dress for realizing the function of specifying in one or more flows of the flowchart and/or one or more blocks of the block diagram
It sets.
These computer program instructions, which may also be stored in, to be able to guide processing of computer or other programmable diagnosis and treatment data and sets
In standby computer-readable memory operate in a specific manner, so that instruction stored in the computer readable memory generates
Manufacture including command device, the command device are realized in one or more flows of the flowchart and/or one, block diagram
The function of being specified in box or multiple boxes.
These computer program instructions can also be loaded into the processing equipment of computer or other programmable diagnosis and treatment data, be made
It obtains and executes series of operation steps on a computer or other programmable device to generate computer implemented processing, thus counting
The instruction executed on calculation machine or other programmable devices is provided for realizing in one or more flows of the flowchart and/or side
The step of function of being specified in block diagram one box or multiple boxes.
Although the application is described in conjunction with specific features and embodiment, it is clear that, do not departing from this Shen
In the case where spirit and scope please, it can be carry out various modifications and is combined.Correspondingly, the specification and drawings are only institute
The exemplary illustration for the application that attached claim is defined, and be considered as covered within the scope of the application any and all and repair
Change, change, combining or equivalent.Obviously, those skilled in the art the application can be carried out various modification and variations without
It is detached from spirit and scope.If in this way, these modifications and variations of the application belong to the claim of this application and its
Within the scope of equivalent technologies, then the application is also intended to include these modifications and variations.
Claims (10)
1. a kind of sample data processing method characterized by comprising
The word segmentation result obtained after short text sample is segmented is obtained, and obtains the keyword bag of words comprising N number of keyword, institute
Stating includes at least one word in word segmentation result;
The first element set is determined according to the word segmentation result and the keyword bag of words, includes in first element set
N number of element, in first element set value of each element be in the keyword bag of words each keyword in the participle
As a result the number occurred in;
The target word in the word segmentation result is obtained, and obtains each keyword in the target word and the keyword bag of words
Similarity value, the target word include the word for existing in the word segmentation result and being not present in the keyword bag of words;
If the similarity value of the first keyword in the target word and the keyword bag of words is greater than similarity threshold, basis
The target word is corresponding with the first keyword described in the similarity value of first keyword update first element set
The first element, obtain second element set;
Wherein, each element in the second element set is used to construct the machine learning model for short text classification.
2. the method according to claim 1, wherein described obtain the keyword bag of words comprising N number of keyword, packet
It includes:
Obtain the training sample set for generating keyword bag of words;
N number of keyword is determined from training sample concentration according to term frequency-inverse document frequency algorithm, according to N number of key
Word generates the keyword bag of words.
3. method according to claim 1 or 2, which is characterized in that first element set is primary vector, described the
Include N number of element in one vector, the primary vector meets:
Wherein, the V1Indicate the primary vector, it is describedIndicate n-th of keyword in keyword bag of words in the participle
As a result the number occurred in, the value range of the n are 1 natural number for arriving N.
4. method according to claim 1-3, which is characterized in that described to obtain the target word and the key
The similarity value of each keyword in word bag of words, comprising:
The term vector of the target word is obtained from term vector database, and obtains the key from the term vector database
The term vector of each keyword in word bag of words;
Calculate the similarity value between the term vector of the target word and the term vector of each keyword;
If the similarity value of the first keyword in the target word and the keyword bag of words is greater than similarity threshold,
The first keyword described in first element set is updated according to the similarity value of the target word and first keyword
Corresponding first element, obtains second element set, comprising:
If the similarity value in the term vector of the target word and the keyword bag of words between the term vector of the first keyword is big
In similarity threshold, then the product for the number that the similarity value occurs in the word segmentation result with the target word is calculated;
Corresponding first element of first keyword described in first element set is updated to the product and described first
The sum of element obtains second element set.
5. method according to claim 1-4, which is characterized in that it is described obtain the second element set it
Afterwards, the method also includes:
Obtain corresponding M the second keyword of M 0 elements in the second element set;
At least one third keyword in the keyword bag of words is obtained, the third keyword goes out in the word segmentation result
Existing number is greater than or equal to 1;
According in each second keyword in the M the second keywords and the keyword bag of words between each third keyword
Similarity value replace 0 element in the second element set, obtain third element set;
Wherein, each element in the third element set is used to construct the machine learning model for short text classification.
6. according to the method described in claim 5, it is characterized in that, described according to each second in the M the second keywords
Similarity value in keyword and the keyword bag of words between each third keyword replaces 0 in the second element set
Element obtains third element set, comprising:
From the term vector for obtaining each second keyword in the M the second keywords in term vector database, and from institute's predicate
The term vector of each third keyword of at least one third keyword in the keyword bag of words is obtained in vector data library;
It obtains in the M the second keywords in the term vector and at least one described third keyword of any second keyword i
Similarity value between the term vector of each third keyword;
If the term vector of the third keyword m in the term vector of the second keyword i and at least one described third keyword
Between similarity value it is maximum, then corresponding 0 element of the second keyword i described in the second element set is replaced with described
Similarity value and the third keyword m between second keyword i and the third keyword m go out in the word segmentation result
The product of existing number obtains third element set.
7. according to the method described in claim 6, it is characterized in that, the third element collection is combined into third vector, the third
Include N number of element in vector, the third vector meets:
Wherein, the V3Indicate the third vector, the wmIndicate that m-th of keyword in keyword bag of words is that third is crucial
Word,Indicate frequency of occurrence of m-th of keyword in the word segmentation result in keyword bag of words,The wiIt indicates
I-th of keyword in keyword bag of words is the second keyword, [cos (wm,wi)]maxIndicate m-th of pass in keyword bag of words
The similarity value of i-th of keyword in keyword and keyword bag of words is maximum.
8. a kind of sample data processing unit characterized by comprising
First obtains module, includes N number of key for obtaining the word segmentation result obtained after short text sample is segmented, and obtaining
The keyword bag of words of word include at least one word in the word segmentation result;
Determining module, for determining the first element set according to the word segmentation result and the keyword bag of words, described first
It include N number of element in element set, the value of each element is each pass in the keyword bag of words in first element set
The number that keyword occurs in the word segmentation result;
Second obtains module, for obtaining the target word in the word segmentation result, and obtains the target word and the keyword
The similarity value of each keyword in bag of words, the target word, which is included in the word segmentation result, to be existed and in the keyword word
The word being not present in bag;
Update module, for working as the target word to the similarity value of the first keyword in the keyword bag of words greater than similar
When spending threshold value, the is updated described in first element set according to the similarity value of the target word and first keyword
Corresponding first element of one keyword, obtains second element set;
Wherein, each element in the second element set is used to construct the machine learning model for short text classification.
9. a kind of terminal, which is characterized in that including processor and memory, the processor and memory are connected with each other, wherein
The memory is for storing computer program, and the computer program includes program instruction, and the processor is configured for
Described program instruction is called, the method according to claim 1 to 7 is executed.
10. a kind of computer readable storage medium, which is characterized in that the computer storage medium is stored with computer program,
The computer program includes program instruction, and described program instruction makes the processor execute such as right when being executed by a processor
It is required that the described in any item methods of 1-7.
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811421160.8A CN109508378B (en) | 2018-11-26 | 2018-11-26 | Sample data processing method and device |
PCT/CN2019/088803 WO2020107835A1 (en) | 2018-11-26 | 2019-05-28 | Sample data processing method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811421160.8A CN109508378B (en) | 2018-11-26 | 2018-11-26 | Sample data processing method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109508378A true CN109508378A (en) | 2019-03-22 |
CN109508378B CN109508378B (en) | 2023-07-14 |
Family
ID=65750624
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811421160.8A Active CN109508378B (en) | 2018-11-26 | 2018-11-26 | Sample data processing method and device |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN109508378B (en) |
WO (1) | WO2020107835A1 (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020107835A1 (en) * | 2018-11-26 | 2020-06-04 | 平安科技(深圳)有限公司 | Sample data processing method and device |
CN111353050A (en) * | 2019-12-27 | 2020-06-30 | 北京合力亿捷科技股份有限公司 | Word stock construction method and tool in vertical field of telecommunication customer service |
CN111625468A (en) * | 2020-06-05 | 2020-09-04 | 中国银行股份有限公司 | Test case duplicate removal method and device |
CN113011533A (en) * | 2021-04-30 | 2021-06-22 | 平安科技(深圳)有限公司 | Text classification method and device, computer equipment and storage medium |
CN113779959A (en) * | 2021-08-31 | 2021-12-10 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Small sample text data mixing enhancement method |
CN117370809A (en) * | 2023-11-02 | 2024-01-09 | 快朵儿(广州)云科技有限公司 | Artificial intelligence model construction method, system and storage medium based on deep learning |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113312481A (en) * | 2021-05-27 | 2021-08-27 | 中国平安人寿保险股份有限公司 | Text classification method, device and equipment based on block chain and storage medium |
CN114548261A (en) * | 2022-02-18 | 2022-05-27 | 北京百度网讯科技有限公司 | Data processing method, data processing device, electronic equipment and storage medium |
CN117009519A (en) * | 2023-07-19 | 2023-11-07 | 上交所技术有限责任公司 | Enterprise leaning industry method based on word bag model |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100235343A1 (en) * | 2009-03-13 | 2010-09-16 | Microsoft Corporation | Predicting Interestingness of Questions in Community Question Answering |
US20140317074A1 (en) * | 2013-04-23 | 2014-10-23 | Microsoft Corporation | Automatic Taxonomy Construction From Keywords |
CN104199959A (en) * | 2014-09-18 | 2014-12-10 | 浪潮软件集团有限公司 | Text classification method for Internet tax-related data |
CN104462244A (en) * | 2014-11-19 | 2015-03-25 | 武汉大学 | Smart city heterogeneous data sharing method based on meta model |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102622378A (en) * | 2011-01-30 | 2012-08-01 | 北京千橡网景科技发展有限公司 | Method and device for detecting events from text flow |
US20160253597A1 (en) * | 2015-02-27 | 2016-09-01 | Xerox Corporation | Content-aware domain adaptation for cross-domain classification |
CN105488023B (en) * | 2015-03-20 | 2019-01-11 | 广州爱九游信息技术有限公司 | A kind of text similarity appraisal procedure and device |
CN105045875B (en) * | 2015-07-17 | 2018-06-12 | 北京林业大学 | Personalized search and device |
CN107103012A (en) * | 2016-01-28 | 2017-08-29 | 阿里巴巴集团控股有限公司 | Recognize method, device and the server of violated webpage |
CN109508378B (en) * | 2018-11-26 | 2023-07-14 | 平安科技(深圳)有限公司 | Sample data processing method and device |
-
2018
- 2018-11-26 CN CN201811421160.8A patent/CN109508378B/en active Active
-
2019
- 2019-05-28 WO PCT/CN2019/088803 patent/WO2020107835A1/en active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100235343A1 (en) * | 2009-03-13 | 2010-09-16 | Microsoft Corporation | Predicting Interestingness of Questions in Community Question Answering |
US20140317074A1 (en) * | 2013-04-23 | 2014-10-23 | Microsoft Corporation | Automatic Taxonomy Construction From Keywords |
CN104199959A (en) * | 2014-09-18 | 2014-12-10 | 浪潮软件集团有限公司 | Text classification method for Internet tax-related data |
CN104462244A (en) * | 2014-11-19 | 2015-03-25 | 武汉大学 | Smart city heterogeneous data sharing method based on meta model |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020107835A1 (en) * | 2018-11-26 | 2020-06-04 | 平安科技(深圳)有限公司 | Sample data processing method and device |
CN111353050A (en) * | 2019-12-27 | 2020-06-30 | 北京合力亿捷科技股份有限公司 | Word stock construction method and tool in vertical field of telecommunication customer service |
CN111625468A (en) * | 2020-06-05 | 2020-09-04 | 中国银行股份有限公司 | Test case duplicate removal method and device |
CN111625468B (en) * | 2020-06-05 | 2024-04-16 | 中国银行股份有限公司 | Test case duplicate removal method and device |
CN113011533A (en) * | 2021-04-30 | 2021-06-22 | 平安科技(深圳)有限公司 | Text classification method and device, computer equipment and storage medium |
CN113011533B (en) * | 2021-04-30 | 2023-10-24 | 平安科技(深圳)有限公司 | Text classification method, apparatus, computer device and storage medium |
CN113779959A (en) * | 2021-08-31 | 2021-12-10 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | Small sample text data mixing enhancement method |
CN117370809A (en) * | 2023-11-02 | 2024-01-09 | 快朵儿(广州)云科技有限公司 | Artificial intelligence model construction method, system and storage medium based on deep learning |
CN117370809B (en) * | 2023-11-02 | 2024-04-12 | 快朵儿(广州)云科技有限公司 | Artificial intelligence model construction method, system and storage medium based on deep learning |
Also Published As
Publication number | Publication date |
---|---|
CN109508378B (en) | 2023-07-14 |
WO2020107835A1 (en) | 2020-06-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109508378A (en) | A kind of sample data processing method and processing device | |
CN108121700A (en) | A kind of keyword extracting method, device and electronic equipment | |
CN108197109A (en) | A kind of multilingual analysis method and device based on natural language processing | |
CN103207860B (en) | The entity relation extraction method and apparatus of public sentiment event | |
CN105045875B (en) | Personalized search and device | |
CN108288067A (en) | Training method, bidirectional research method and the relevant apparatus of image text Matching Model | |
CN105243129A (en) | Commodity property characteristic word clustering method | |
CN106570144A (en) | Method and apparatus for recommending information | |
CN112148889A (en) | Recommendation list generation method and device | |
CN106202042A (en) | A kind of keyword abstraction method based on figure | |
CN106598949B (en) | A kind of determination method and device of word to text contribution degree | |
CN103116588A (en) | Method and system for personalized recommendation | |
CN108228758A (en) | A kind of file classification method and device | |
CN111316296A (en) | Structure of learning level extraction model | |
CN110134792A (en) | Text recognition method, device, electronic equipment and storage medium | |
CN107291825A (en) | With the search method and system of money commodity in a kind of video | |
CN110008309A (en) | A kind of short phrase picking method and device | |
CN106445906A (en) | Generation method and apparatus for medium-and-long phrase in domain lexicon | |
CN104199838B (en) | A kind of user model constructing method based on label disambiguation | |
CN105893362A (en) | A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points | |
CN108228612B (en) | Method and device for extracting network event keywords and emotional tendency | |
CN113722478B (en) | Multi-dimensional feature fusion similar event calculation method and system and electronic equipment | |
CN109614626A (en) | Keyword Automatic method based on gravitational model | |
CN107169061A (en) | A kind of text multi-tag sorting technique for merging double information sources | |
CN106445907A (en) | Domain lexicon generation method and apparatus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |