CN109508378A - A kind of sample data processing method and processing device - Google Patents

A kind of sample data processing method and processing device Download PDF

Info

Publication number
CN109508378A
CN109508378A CN201811421160.8A CN201811421160A CN109508378A CN 109508378 A CN109508378 A CN 109508378A CN 201811421160 A CN201811421160 A CN 201811421160A CN 109508378 A CN109508378 A CN 109508378A
Authority
CN
China
Prior art keywords
keyword
words
bag
word
element set
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811421160.8A
Other languages
Chinese (zh)
Other versions
CN109508378B (en
Inventor
周涛涛
周宝
陈远旭
王健宗
肖京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201811421160.8A priority Critical patent/CN109508378B/en
Publication of CN109508378A publication Critical patent/CN109508378A/en
Priority to PCT/CN2019/088803 priority patent/WO2020107835A1/en
Application granted granted Critical
Publication of CN109508378B publication Critical patent/CN109508378B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the present application discloses a kind of sample data processing method and processing device, this method is suitable for the machine learning model training of short text classification, this method comprises: obtaining the word segmentation result after short text sample is segmented, and obtain the keyword bag of words comprising N number of keyword, the first element set is determined according to the word segmentation result and the keyword bag of words, obtain the target word in the word segmentation result, and obtain the similarity value of each keyword in the target word and the keyword bag of words, when the similarity value of the first keyword in the target word and the keyword bag of words is greater than similarity threshold, the first element in first element set is then updated according to the similarity value of the target word and first keyword, obtain second element set, each element in the second element set is used to construct the machine learning model for short text classification.Using the embodiment of the present application, the performance of the machine learning model using element set building can be improved.

Description

A kind of sample data processing method and processing device
Technical field
This application involves field of computer technology more particularly to a kind of sample data processing method and processing devices.
Background technique
With the development of the applications such as microblogging, social network sites and hotline, more and more information start with short text Form is presented, and is in explosive growth.It is crucial that text classification can help people quickly and effectively to obtain from mass data Information, and the accuracy of text classification depends on the performance of machine learning model, the performance of machine learning model again relies on sample Notebook data.
Existing samples of text data processing method mostly uses greatly based on keyword bag of words (Bag of Words) model Method, this method, which is used in long text, usually can obtain preferable effect, but be used in usually ineffective, quality in short text It is low.Main cause is, compared to long text, short text has the characteristics that feature is sparse, theme is indefinite.Firstly, since short essay The limitation of this length, Feature Words are seldom, and big with the sample data dimension that keyword bag of words generate, and which increase texts The difficulty of processing.Secondly, relevant word would generally largely occur with theme in long text, entire chapter text can be thus judged The main contents of chapter;And then main contents cannot be judged according to word frequency in short text.Such as short text " consulting shuttlecock master In the dining room of topic ", " shuttlecock " is identical with the word frequency in " dining room ", it is apparent that the theme of the text is " dining room ", in text classification When should be assigned to " food and drink " this classification rather than " movement " classification.It can be seen that existing sample data processing method cannot Short text is indicated well.
To sum up, because short text has the characteristics that above-mentioned feature is sparse and theme is indefinite, utilize existing sample Performance of the sample data building that notebook data processing method obtains for the machine learning model of text classification is poor, and text classification is quasi- True property is low.
Summary of the invention
The embodiment of the present application provides a kind of sample data processing method and processing device, can increase the letter of short text sample data Breath amount, improves the performance of the machine learning model using short text sample data building, and then improves the essence of short text classification Exactness.
In a first aspect, the embodiment of the present application provides a kind of sample data processing method, this method comprises:
The word segmentation result obtained after short text sample is segmented is obtained, and obtains the keyword word comprising N number of keyword Bag, it include at least one word in the word segmentation result;
The first element set is determined according to the word segmentation result and the keyword bag of words, includes N in first element set A element, the value of each element is that each keyword goes out in the word segmentation result in the keyword bag of words in first element set Existing number;
The target word in the word segmentation result is obtained, and obtains the phase of the target word with each keyword in the keyword bag of words Like angle value, which includes the word for existing in the word segmentation result and being not present in the keyword bag of words;
If the similarity value of the first keyword in the target word and the keyword bag of words is greater than similarity threshold, basis It is first yuan corresponding that the target word with the similarity value of first keyword updates first keyword in first element set Element obtains second element set;
Wherein, each element in the second element set is used to construct the machine learning model for short text classification.
With reference to first aspect, in a kind of possible embodiment, the keyword bag of words comprising N number of keyword, packet are obtained It includes: obtaining the training sample set for generating keyword bag of words;It is concentrated according to term frequency-inverse document frequency algorithm from the training sample It determines N number of keyword, which is generated according to N number of keyword.
With reference to first aspect, in a kind of possible embodiment, first element set be primary vector, this first to Include N number of element in amount, which meets:
Wherein, the V1Indicate the primary vector, it shouldIndicate n-th of keyword in keyword bag of words in the participle knot The number occurred in fruit, the value range of the n are 1 natural number for arriving N.
With reference to first aspect, in a kind of possible embodiment, obtain the target word with it is each in the keyword bag of words The similarity value of keyword, comprising: obtain the term vector of the target word from term vector database, and from the term vector database The middle term vector for obtaining each keyword in the keyword bag of words;Calculate the term vector of the target word and the word of each keyword Similarity value between vector;
If the similarity value of the first keyword in the target word and the keyword bag of words is greater than similarity threshold, basis It is first yuan corresponding that the target word with the similarity value of first keyword updates first keyword in first element set Element obtains second element set, comprising:
If the similarity value in the term vector of the target word and the keyword bag of words between the term vector of the first keyword is big In similarity threshold, then the product for the number that the similarity value occurs in the word segmentation result with the target word is calculated;By this Corresponding first element of first keyword is updated to the sum of the product and first element in one element set, obtains second yuan Element set.
With reference to first aspect, in a kind of possible embodiment, after obtaining the second element set, this method is also It include: to obtain corresponding M the second keyword of M 0 elements in the second element set;It obtains in the keyword bag of words at least One third keyword, the number which occurs in the word segmentation result are greater than or equal to 1;According to the M second Similarity value in keyword in each second keyword and the keyword bag of words between each third keyword replace this second 0 element in element set, obtains third element set.Wherein, each element in the third element set is used for constructing In the machine learning model of short text classification.
With reference to first aspect, crucial according in the M the second keywords each second in a kind of possible embodiment Similarity value in word and the keyword bag of words between each third keyword replaces 0 element in the second element set, obtains To third element set, comprising:
From the term vector for obtaining each second keyword in the M the second keywords in term vector database, and from the word The term vector of each third keyword of at least one third keyword in the keyword bag of words is obtained in vector data library;It obtains Each third is crucial in the term vector and at least one third keyword of any second keyword i in the M the second keywords Similarity value between the term vector of word;If the in the term vector of second keyword i and at least one third keyword Similarity value between the term vector of three keyword m is maximum, then by second keyword i corresponding 0 in the second element set Element replaces with the similarity value between the second keyword i and third keyword m with the third keyword m in the participle knot The product of the number occurred in fruit obtains third element set.
With reference to first aspect, in a kind of possible embodiment, which is combined into third vector, the third to Include N number of element in amount, which meets:
Wherein, the V3Indicate the third vector, the wmIndicate that m-th of keyword in keyword bag of words is third keyword,Indicate frequency of occurrence of m-th of keyword in the word segmentation result in keyword bag of words,The wiIndicate keyword word I-th of keyword in bag is the second keyword, [cos (wm,wi)]maxIt indicates m-th of keyword in keyword bag of words and closes The similarity value of i-th of keyword in keyword bag of words is maximum.
Second aspect, the embodiment of the present application provide a kind of sample data processing unit, which includes:
First obtains module, for obtaining the word segmentation result obtained after short text sample is segmented, and obtains comprising N number of The keyword bag of words of keyword include at least one word in the word segmentation result;
Determining module, for determining the first element set according to the word segmentation result and the keyword bag of words, this first yuan It include N number of element in element set, the value of each element is that each keyword exists in the keyword bag of words in first element set The number occurred in the word segmentation result;
Second obtains module, for obtaining the target word in the word segmentation result, and obtains the target word and the keyword word The similarity value of each keyword in bag, which, which is included in the word segmentation result, exists and does not deposit in the keyword bag of words Word;
Update module, for working as the target word to the similarity value of the first keyword in the keyword bag of words greater than similar When spending threshold value, first keyword in first element set is updated according to the similarity value of the target word and first keyword Corresponding first element, obtains second element set;
Wherein, each element in the second element set is used to construct the machine learning model for short text classification.
It include first acquisition unit in the first acquisition module in a kind of possible embodiment in conjunction with second aspect And second acquisition unit, the first acquisition unit is for obtaining the word segmentation result obtained after short text sample is segmented, this point It include at least one word in word result;The second acquisition unit is for obtaining the keyword bag of words comprising N number of keyword.Wherein, The second acquisition unit is specifically used for: obtaining the training sample set for generating keyword bag of words;According to term frequency-inverse document frequency Algorithm determines N number of keyword from training sample concentration, generates the keyword bag of words according to N number of keyword.
In conjunction with second aspect, in a kind of possible embodiment, first element set be primary vector, this first to Include N number of element in amount, which meets:
Wherein, the V1Indicate the primary vector, it shouldIndicate n-th of keyword in keyword bag of words in the participle knot The number occurred in fruit, the value range of the n are 1 natural number for arriving N.
In conjunction with second aspect, in a kind of possible embodiment, this second acquisition module include third acquiring unit and 4th acquiring unit, the third acquiring unit are used to obtain the target word in the word segmentation result, which is included in the participle As a result the word for existing in and being not present in the keyword bag of words;4th acquiring unit is for obtaining the target word and the key The similarity value of each keyword in word bag of words.Wherein, the 4th acquiring unit is specifically used for: obtaining from term vector database The term vector of the target word, and from the term vector for obtaining each keyword in the keyword bag of words in the term vector database;Meter Calculate the similarity value between the term vector of the target word and the term vector of each keyword.
The update module is specifically used for: if in the term vector of the target word and the keyword bag of words the first keyword word to Similarity value between amount is greater than similarity threshold, then calculates what the similarity value occurred in the word segmentation result with the target word The product of number;Corresponding first element of first keyword in first element set is updated to the product and this first yuan The sum of element obtains second element set.
In conjunction with second aspect, in a kind of possible embodiment, the device further include: third obtains module, for obtaining Take corresponding M the second keyword of M 0 elements in the second element set;4th obtains module, for obtaining the keyword word At least one third keyword in bag, the number which occurs in the word segmentation result are greater than or equal to 1;Replacement Module, for according to each third keyword in each second keyword in the M the second keywords and the keyword bag of words it Between similarity value replace 0 element in the second element set, obtain third element set.Wherein, the third element set In each element be used for construct for short text classify machine learning model.
In conjunction with second aspect, in a kind of possible embodiment, which is specifically used for:
From the term vector for obtaining each second keyword in the M the second keywords in term vector database, and from the word The term vector of each third keyword of at least one third keyword in the keyword bag of words is obtained in vector data library;It obtains Each third is crucial in the term vector and at least one third keyword of any second keyword i in the M the second keywords Similarity value between the term vector of word;If the in the term vector of second keyword i and at least one third keyword Similarity value between the term vector of three keyword m is maximum, then by second keyword i corresponding 0 in the second element set Element replaces with the similarity value between the second keyword i and third keyword m with the third keyword m in the participle knot The product of the number occurred in fruit obtains third element set.
In conjunction with second aspect, in a kind of possible embodiment, which is combined into third vector, the third to Include N number of element in amount, which meets:
Wherein, the V3Indicate the third vector, the wmIndicate that m-th of keyword in keyword bag of words is third keyword,Indicate frequency of occurrence of m-th of keyword in the word segmentation result in keyword bag of words,The wiIndicate keyword word I-th of keyword in bag is the second keyword, [cos (wm,wi)]maxIt indicates m-th of keyword in keyword bag of words and closes The similarity value of i-th of keyword in keyword bag of words is maximum.
The third aspect, the embodiment of the present application provide a kind of terminal, including processor and memory, the processor and storage Device is connected with each other, wherein the memory is used to store the computer program for supporting terminal to execute the above method, the computer program Including program instruction, which is configured for calling the program instruction, executes the sample data processing of above-mentioned first aspect Method.
Fourth aspect, the embodiment of the present application provide a kind of computer readable storage medium, which deposits Computer program is contained, which includes program instruction, which makes the processor when being executed by a processor Execute the sample data processing method of above-mentioned first aspect.
The embodiment of the present application by obtain short text sample segmented after word segmentation result, and obtain include N number of key The keyword bag of words of word determine the first element set according to the word segmentation result and the keyword bag of words, obtain the word segmentation result In target word, and obtain the similarity value of each keyword in the target word and the keyword bag of words, when the target word with should The similarity value of the first keyword in keyword bag of words is greater than similarity threshold, then according to the target word and first keyword Similarity value update corresponding first element of first keyword in first element set, obtain second element set, should Each element in second element set is used to construct the machine learning model for short text classification, can increase short text sample Information content in this element set improves the property of the machine learning model constructed using the element set of the short text sample Can, and then improve the accuracy of short text classification.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in embodiment description Attached drawing is briefly described, it should be apparent that, the accompanying drawings in the following description is some embodiments of the present application, general for this field For logical technical staff, without creative efforts, it is also possible to obtain other drawings based on these drawings.
Fig. 1 is a schematic flow diagram of sample data processing method provided by the embodiments of the present application;
Fig. 2 is a schematic flow diagram of keyword bag of words generation method provided by the embodiments of the present application;
Fig. 3 is the schematic diagram for the entry that keyword " basketball " returns;
Fig. 4 is another schematic flow diagram of sample data processing method provided by the embodiments of the present application;
Fig. 5 is a schematic block diagram of sample data processing unit provided by the embodiments of the present application;
Fig. 6 is a schematic block diagram of terminal provided by the embodiments of the present application.
Specific embodiment
Below in conjunction with the attached drawing in the embodiment of the present application, technical solutions in the embodiments of the present application carries out clear, complete Site preparation description, it is clear that described embodiment is some embodiments of the present application, instead of all the embodiments.Based on this Shen Please in embodiment, every other implementation obtained by those of ordinary skill in the art without making creative efforts Example, shall fall in the protection scope of this application.
It should be appreciated that the description and claims of this application and term " first " in the attached drawing, " second ", " third " etc. is not use to describe a particular order for distinguishing different objects.In addition, term " includes " and " having " and Their any deformations, it is intended that cover and non-exclusive include.Such as contain a series of steps or units process, method, System, product or equipment are not limited to listed step or unit, but optionally further comprising the step of not listing or Unit, or optionally further comprising other step or units intrinsic for these process, methods, product or equipment.
It is also understood that referenced herein " embodiment " it is meant that describe in conjunction with the embodiments special characteristic, structure or Characteristic may be embodied at least one embodiment of the application.Each position in the description shows that the phrase might not Identical embodiment is each meant, nor the independent or alternative embodiment with other embodiments mutual exclusion.Those skilled in the art Member explicitly and implicitly understands that embodiment described herein can be combined with other embodiments.
It will be further appreciated that the term "and/or" used in present specification and the appended claims is Refer to any combination and all possible combinations of one or more of associated item listed, and including these combinations.
Below in conjunction with Fig. 1 to Fig. 6, sample data processing method and processing device provided by the embodiments of the present application is said It is bright.
It is a schematic flow diagram of sample data processing method provided by the embodiments of the present application referring to Fig. 1.As shown in Figure 1, The sample data processing method may include step:
S101, terminal obtains the word segmentation result obtained after short text sample is segmented, and obtains comprising N number of keyword Keyword bag of words.
In some possible embodiments, the short text sample of the available user's input of terminal, and base can be used Segmenting method in string matching or the machine learning algorithm based on statistics etc. segment the short text sample, are divided Word segmentation result after word.It may include at least one word in the word segmentation result.Terminal can be obtained from keyword bag of words database Taking any one includes the keyword bag of words of N number of keyword.N can be the integer greater than 1.It can in the keyword bag of words database To include the keyword bag of words of preset multiple classifications.Each keyword bag of words in the keyword bag of words database include N A keyword.The keyword bag of words can be used to indicate that the set for ignoring the word of word order, grammer and syntax in text.For example, short Samples of text is " today, how is weather? ", the word segmentation result after participle is " today/weather/how " these three words.
Wherein, the message length of short text sample is all shorter, usually within 200 words, therefore the effective letter for being included Breath is also considerably less, usually has the characteristics that sparsity, real-time, scrambling.Segmenting method based on string matching can be with Including Forward Maximum Method method, reverse maximum matching method, bi-directional matching participle method etc., the machine learning algorithm based on statistics can be with Including hidden Markov model (hidden markov model, HMM), condition random field (conditional random Fields, CRF) etc..
In some possible embodiments, the mode that terminal obtains keyword bag of words may refer to Fig. 2.Fig. 2 is this Shen Please embodiment provide keyword bag of words generation method a schematic flow diagram.As shown in Fig. 2, the keyword of the embodiment of the present application Bag of words generation method may include step:
S1011 obtains the training sample set for generating keyword bag of words.
In some possible embodiments, terminal can know the upper predetermined different classes of pass of input in Baidu Keyword scans for, and can crawl the title that the different classes of keyword returns, such as search weather class keywords " weather ", " light rain " etc.;Move class keywords " basketball ", " baseball ", " running " etc..Terminal can will know the mark for swashing and getting in Baidu Topic carries out manual sort and mark, obtains the corresponding training sample of each classification.Terminal can be by the corresponding training of each classification Sample is added in the set of training sample, forms training sample set.Wherein, a classification can correspond to a training sample, should It may include same category of multiple titles (sample) in training sample.
For example, as shown in figure 3, being the schematic diagram for the entry that keyword " basketball " returns.Wherein, basketball is the pass for moving class One of keyword knows that upper input keyword " basketball " scans in Baidu, and terminal crawls preceding the 50 of keyword " basketball " return Title.Title " diameter of a basketball has much " as shown in Figure 3, " specification of basketball? ", " the standard size of basketball court It is how many? " etc..
S1012 determines N number of keyword from training sample concentration according to term frequency-inverse document frequency algorithm, according to N number of pass Keyword generates keyword bag of words.
In some possible embodiments, terminal can use participle tool (such as jieba participle, StanfordCoreNlp participle etc.) each sample (i.e. title) in each training sample of above-mentioned training sample set is carried out Participle.For the training sample of each classification, terminal can use term frequency-inverse document frequency (term frequency- Inverse document frequency, TF-IDF) algorithm calculates each sample in the training sample and obtains after being segmented Each word TF-IDF value, and TF-IDF can be obtained by the TF-IDF value of the calculated category by sorting from large to small Sequence, and the corresponding N number of word of top n TF-IDF value can be extracted from the TF-IDF sequence, then can carry out to N number of word Artificial additions and deletions obtain the keyword bag of words comprising N number of keyword.N is the integer greater than 1.Terminal can preset a keyword word Bag database, the keyword bag of words database can be used for storing the keyword bag of words of at least one classification.The keyword bag of words Each keyword bag of words in database include N number of keyword.
Wherein, the value of TF-IDF is equal to TFwWith IDFwProduct, TFwAnd IDFwMeet respectively:
Entry w can be used to indicate that any word that each sample obtains after being segmented in training sample, corpus can be used In the set for indicating all training samples, i.e. training sample set, number of files can be used to indicate that the number of training sample, number of files It can be equal with the classification number of training sample.It should be noted that the total number of documents of corpus is greater than 2 in the embodiment of the present application, That is the training sample of at least 3 classifications of training sample concentration.
For example, it is assumed that crawling the training sample of " weather, credit card, air ticket " three classifications.N=100.Due to each class Other keyword bag of words generating mode is consistent, therefore by taking the keyword bag of words of generation " weather " classification as an example.Terminal utilizes first Jieba participle tool or stanfordCoreNlp participle tool segment all samples (title) in " weather " classification, Further according to TF-IDF formula calculate " weather " classification sample segmented after the obtained TF-IDF value of each word, will calculate The TF-IDF value of " weather " classification out filters out in " weather " classification TF-IDF value preceding 100 by sorting from large to small Keyword bag of words of the word as " weather " classification.Wherein, the corpus in TF-IDF is by " weather ", " credit card " and " air ticket " The training sample composition of three classifications, entry w are any word obtained after the training sample of " weather " classification is segmented, document For training sample.IDF at this timewThe total number of documents of middle corpus is 3, respectively " weather ", " credit card " and " air ticket " three classes Other training sample, the number of files comprising entry w are 1, the i.e. training sample of " weather " classification.
It since the keyword in traditional keyword bag of words is more, for example include 1000 keys in a keyword bag of words Word, and short text length is shorter (usually within 200 words), therefore for short text, it is raw with traditional keyword bag of words At vector dimension it is excessive.Terminal in the embodiment of the present application constructs lesser keyword bag of words (packet using TF-IDF algorithm Include 100 keywords), greatly avoid the excessive problem of dimension when generating vector using keyword bag of words.
S102, terminal determine the first element set according to word segmentation result and keyword bag of words.
In some possible embodiments, terminal can count in above-mentioned keyword bag of words each keyword above-mentioned short The number occurred in the word segmentation result of samples of text.Terminal can be by each keyword in the keyword bag of words in the word segmentation result The number of middle appearance is as the element in the first element set.It wherein, may include N number of element in first element set.
In some possible embodiments, above-mentioned first element set can be primary vector, can in the primary vector To include N number of element.The primary vector meets:
Wherein, V1Indicate primary vector,Indicate n-th of word in keyword bag of words in the participle knot of short text sample The number occurred in fruit, the value range of n are 1 natural number for arriving N.
It is " basketball/weather/football/running/today " respectively for example, it is assumed that there is 5 keywords in keyword bag of words, that Primary vector V1In include 5 elements.Assuming that second keyword " weather " and the 5th keyword in keyword bag of words " today " respectively occurs once in the word segmentation result of short text sample, and terminal determines primary vector V1In second element and Five elements are 1.And other keyword (i.e. first keyword " basketball ", third keyword " foot in keyword bag of words Ball " and the 4th keyword " running ") number that occurs in the word segmentation result of short text sample is 0, terminal determines first Vector V1In first, third, the 4th element be all 0, primary vector V at this time1=[0,1,0,0,1].
S103, terminal obtain the target word in word segmentation result, and obtain each keyword in target word and keyword bag of words Similarity value.
In some possible embodiments, terminal can search whether there is in the word segmentation result of above-mentioned short text sample Target word.If the either objective there are at least one target word in the word segmentation result, in the available word segmentation result of terminal Word.Terminal can calculate the similarity value between each keyword in the target word and above-mentioned keyword bag of words, as cosine value, Euclidean distance etc..If target word is not present in the word segmentation result, above-mentioned first element set can be determined as second by terminal Element set.Wherein, which, which can be used to indicate that, exists in the word segmentation result and does not deposit in above-mentioned keyword bag of words Word.
For example, it is assumed that including 5 words, respectively C1, C2, C3, C4 and C5 in the word segmentation result of short text sample.Terminal Detecting respectively whether there is keyword identical with word C1, C2, C3, C4, C5 in keyword bag of words.If not deposited in keyword bag of words In keyword identical with word C2 and C5, then word C2 and C5 are determined as target word respectively by terminal.Terminal obtain target word C2 and Either objective word such as C2 in C5, terminal calculate the similarity value in target word C2 and keyword bag of words between each keyword.
S104, if the similarity value of the first keyword in target word and keyword bag of words is greater than similarity threshold, eventually It holds according to target word the first element corresponding with the first keyword in the similarity value of the first keyword the first element set of update, Obtain second element set.
In some possible embodiments, terminal each pass in getting above-mentioned target word and above-mentioned keyword bag of words After similarity value between keyword, can detecte the similarity value between the target word and each keyword whether be greater than it is pre- If similarity threshold.When detect the similarity value between the target word and the first keyword in the key bag of words be greater than the phase When like degree threshold value, illustrate the target word and first keyword is synonym, then terminal can the target word and first key The similarity value of word updates corresponding first element of first keyword in above-mentioned first element set, obtains second element collection It closes.When detect the similarity value in the target word and the key bag of words between each keyword be respectively less than or be equal to the similarity When threshold value, illustrate there is no the synonym of the target word in the keyword bag of words, then in the available above-mentioned word segmentation result of terminal under Similarity value in one target word and the keyword bag of words between each keyword.If each target in above-mentioned word segmentation result Similarity value in word and the keyword bag of words between each keyword is respectively less than or is equal to the similarity threshold, illustrates the key There is no the synonym of either objective word in word bag of words, then above-mentioned first element set can be determined as second element collection by terminal It closes.Each element in the second element set can be used for constructing the machine learning model for short text classification.
Due to the keyword negligible amounts of keyword bag of words in the embodiment of the present application, so as to cause keyword bag of words are being utilized When determining the first element set with the word segmentation result of short text sample, the partial information in short text sample is lost, that is, is occurred short Word present in the word segmentation result of samples of text not in keyword bag of words situation namely short text sample in there is target word The case where.Again because each element in the first element set is time that each keyword occurs in word segmentation result in keyword bag of words Number, then the target word in word segmentation result does not embody in the first element set.So the embodiment of the present application by target word with The similarity value of each keyword in keyword bag of words finds the synonym of target word in keyword bag of words, updates target word Synonym in the first element set corresponding element, obtain second element set.To increase in the first element set Information content, solve the problems, such as because keyword bag of words diminution and lose information, and then utilize second element set structure When building the conventional machines learning model for short text classification, the machine learning model of available better performances, so that short essay This classification is more acurrate.
In the embodiment of the present application, the word segmentation result after terminal is segmented by acquisition short text sample, and obtain packet Keyword bag of words containing N number of keyword determine the first element set according to the word segmentation result and the keyword bag of words, and obtaining should Target word in word segmentation result, and the similarity value of each keyword in the target word and the keyword bag of words is obtained, when the mesh The similarity value for marking the first keyword in word and the keyword bag of words is greater than similarity threshold, then according to the target word and this The similarity value of one keyword updates corresponding first element of first keyword in first element set, obtains second element Gather, each element in the second element set is used to construct the machine learning model for short text classification, can increase Information content in the element set of short text sample improves the machine learning mould constructed using the element set of the short text sample The performance of type, and then improve the accuracy of short text classification.
It referring to fig. 4, is another schematic flow diagram of sample data processing method provided by the embodiments of the present application.Such as Fig. 4 institute Show, which may include step:
S401, terminal obtains the word segmentation result obtained after short text sample is segmented, and obtains comprising N number of keyword Keyword bag of words.
S402, terminal determine the first element set according to word segmentation result and keyword bag of words.
In some possible embodiments, the step S401- step S402 in the embodiment of the present application can refer to Fig. 1 institute Show the implementation of the step S101- step S102 in embodiment, details are not described herein.
S403, terminal obtain the target word in word segmentation result.
S404, terminal obtain the term vector of target word from term vector database, and obtain and close from term vector database The term vector of each keyword in keyword bag of words.
S405, terminal calculate the similarity value between the term vector of target word and the term vector of each keyword.
S406, if the similarity value in the term vector of target word and keyword bag of words between the term vector of the first keyword is big In similarity threshold, then the product for the number that terminal calculating similarity value and target word occur in word segmentation result.
Corresponding first element of first keyword in first element set is updated to product and the first element by S407, terminal The sum of, obtain second element set.
In some possible embodiments, terminal can search whether there is in the word segmentation result of above-mentioned short text sample Target word.If the either objective there are at least one target word in the word segmentation result, in the available word segmentation result of terminal Word.Terminal can obtain the term vector of the target word from term vector database, and can obtain from the term vector database The term vector of each keyword in the keyword bag of words.Terminal can calculate the term vector and each keyword of the target word Similarity value (such as cosine value, Euclidean distance) between term vector.The term vector that terminal can detecte the target word is each with this Whether the similarity value between the term vector of a keyword is greater than preset similarity threshold.When detect the word of the target word to When similarity value in amount and the key bag of words between the term vector of the first keyword is greater than the similarity threshold, illustrate the target Word and first keyword are synonym, then terminal can calculate the similarity value and the target word in above-mentioned short text sample The product of the number occurred in word segmentation result.Terminal can be by first keyword corresponding first in above-mentioned first element set Element is updated to the sum of the product and first element, obtains second element set.Wherein, which can be used to indicate that The word for existing in the word segmentation result and being not present in above-mentioned keyword bag of words.
In some possible embodiments, if target word is not present in the word segmentation result, terminal can be by above-mentioned the One element set is determined as second element set.Alternatively, when detect the term vector of the target word with it is each in the key bag of words When similarity value between the term vector of keyword is respectively less than or is equal to the similarity threshold, illustrate do not have in the keyword bag of words The synonym of the target word, then in the available above-mentioned word segmentation result of terminal next target word term vector Yu the keyword word Similarity value in bag between the term vector of each keyword.If the term vector of each target word and the pass in above-mentioned word segmentation result Similarity value in keyword bag of words between the term vector of each keyword is respectively less than or is equal to the similarity threshold, illustrates the key There is no the synonym of either objective word in word bag of words, then above-mentioned first element set can be determined as second element collection by terminal It closes.
In some possible embodiments, the generation method of term vector database can be with are as follows: (1) terminal is from wikipedia In crawl on a large scale without mark corpus (probably in 10 G or so), and to these without mark corpus segment, after participle These corpus are inputted in continuous keyword bag of words (continuous bag of words, CBOW) and are trained.(2) exist After CBOW model training, the term vector of all words in these corpus of terminal available CBOW model output, and can be with The term vector of words all in these corpus is stored into term vector database.
In some possible embodiments, above-mentioned second element set can be secondary vector, can in the secondary vector To include N number of element.The secondary vector meets:
Wherein, V2Indicate secondary vector, wjIndicate the target word in the word segmentation result of short text sample, wkIndicate keyword K-th of keyword in bag of words is the first keyword, cos (wk,wj) indicate target word wjTerm vector and keyword bag of words in the One keyword wkTerm vector between cosine value, cos (wk,wj) it is greater than preset similarity threshold such as 0.7, i.e. wjWith wkIt is same Adopted word.Terminal is by primary vector V1In the first elementIt is updated to Indicate that target word exists The number occurred in the word segmentation result of short text sample,Indicate the first keyword in keyword bag of words in short text sample Word segmentation result in the number that occurs.
For example, it is assumed that having 2 target words, respectively target word w in the word segmentation result of short text samplej1With target word wj2。 Assuming that having 10 keywords, respectively w in keyword bag of words1,w2,w3,...,w10.Preset similarity threshold is 0.7.Terminal Calculate separately target word wj1Term vector and keyword bag of words in 10 keyword w1,w2,w3,...,w10Term vector between Cosine value.Assuming that target word wj1Term vector and the 4th keyword w4Term vector between cosine value cos (w4,wj1)= 0.8 is greater than preset similarity threshold 0.7, and terminal calculates cosine value cos (w4,wj1) and target word wj1Occur in word segmentation result NumberProductTerminal is by primary vector V1In the 4th elementIt is updated to the 4th ElementWith productThe sum of, i.e.,Terminal calculates separately target word wj2's 10 keyword w in term vector and keyword bag of words1,w2,w3,...,w10Term vector between cosine value.Assuming that target word wj2Term vector and third keyword w3, main points word w6, the 7th keyword w7Term vector between cosine value it is equal Greater than similarity threshold 0.7, i.e. cos (w3,wj2)、cos(w6,wj2) and cos (w7,wj2) value be all larger than 0.7.Terminal difference is more New primary vector V1In third element6th elementWith the 7th element
S408, terminal obtain corresponding M the second keyword of M 0 elements in second element set.
In some possible embodiments, terminal can detecte in above-mentioned second element set with the presence or absence of 0 element.If There are 0 element in the second element set, then the M in the available above-mentioned second element set of terminal 0 elements, and can will The corresponding keyword of each 0 element is determined as the second keyword in the second element set, to obtain M the second keywords. If any 0 element is not present in the second element set, above-mentioned second element set can be determined as third element by terminal Set.Wherein, M can be the integer more than or equal to 1.0 element in second element set can be used to indicate that above-mentioned key The number that some keyword in word bag of words occurs in above-mentioned word segmentation result is 0 and the synonym of the keyword is also at above-mentioned point The number occurred in word result is 0, i.e. the second keyword do not occurred in the word segmentation result and second keyword it is synonymous Word did not also occur in the word segmentation result.
In some possible embodiments, terminal can appoint from above-mentioned keyword bag of words takes a keyword, and can To detect whether the keyword in the keyword bag of words takes.If the keyword in the detection keyword bag of words does not take, inspection It surveys whether the number that the keyword occurs in above-mentioned word segmentation result is 0, occurs in above-mentioned word segmentation result when the keyword When number is 0, the similarity value in the keyword and the word segmentation result between each word is calculated, when the keyword and the participle knot When similarity value in fruit between each word is respectively less than or is equal to preset similarity threshold, illustrate in the word segmentation result without being somebody's turn to do The keyword can be then determined as the second keyword by the synonym of keyword.It to be closed if there is no second in the keyword bag of words Above-mentioned second element set then can be determined as third element set by keyword.
S409, terminal obtain at least one third keyword in keyword bag of words.
S410, according to each third keyword in each second keyword in M the second keywords and keyword bag of words it Between similarity value replacement second element set in 0 element, obtain third element set.
In some possible embodiments, terminal can obtain at least one third key from above-mentioned keyword bag of words Word, the number which occurs in above-mentioned word segmentation result are greater than or equal to 1.Terminal can calculate above-mentioned M second Similarity value (such as cosine in keyword in each second keyword and above-mentioned keyword bag of words between each third keyword Value, Euclidean distance etc.), and 0 element in above-mentioned second element can be replaced according to the similarity value, obtain third element collection It closes.Each element in the third element set is used to construct the machine learning model for short text classification.Because of engineering The cardinal principle for practising model is finally to export the element set multiplied by a weight on each element in element set and belong to Different classes of probability.If there are 0 elements in element set, then the multiplied by weight of 0 element and arbitrary value, can still obtain 0, this When machine learning model output probability may be 0, so as to cause in the presence of the element set of 0 element construct be used for short text The machine learning model accuracy of classification is not high.So the embodiment of the present application is by replacing 0 yuan in second element set Element is not present 0 element, can improve machine learning model while guaranteeing element set information content that is, in third element set Accuracy, thus improve short text classification accuracy.
In some possible embodiments, terminal is after getting at least one third keyword, can be from above-mentioned The term vector of each second keyword of above-mentioned M the second keywords is obtained in term vector database, and can be from the term vector The term vector of each third keyword of at least one third keyword in above-mentioned keyword bag of words is obtained in database.For this Any second keyword i in M the second keywords, the term vector of available second keyword i of terminal and this at least one Similarity value (such as cosine value, Euclidean distance) in third keyword between the term vector of each third keyword.If this Similarity value between third keyword m in the term vector of two keyword i and at least one third keyword is maximum, then eventually The number for holding the available third keyword m to occur in above-mentioned word segmentation result.Terminal can calculate second keyword i's The product of similarity value between term vector and third keyword m and the number, and can will be in above-mentioned second element set Corresponding 0 element of second keyword i replaces with the product.Terminal is by the second keywords pair of the M in the second element set Available third element set after M 0 elements answered all are replaced.Each element in the third element set is for constructing Machine learning model for short text classification.
In some possible embodiments, above-mentioned third element set can be third vector, can in the third vector To include N number of element.The third vector meets:
Wherein, V3Indicate third vector, wmIndicate that m-th of keyword in keyword bag of words is third keyword,Table Show the number that m-th of keyword occurs in the word segmentation result of short text sample in keyword bag of words,The wiIt indicates I-th of keyword in keyword bag of words is the second keyword, [cos (wm,wi)]maxIndicate m-th of pass in keyword bag of words The similarity value of i-th of keyword in keyword and keyword bag of words is maximum.
For example, it is assumed that the 5th element in second element set is 0, the 5th element is corresponding crucial in second element set Word is w5(i.e. the second keyword).Assuming that there are 3 third keywords, respectively w in keyword bag of words3、w6And w12.Terminal Calculate separately the second keyword w5Term vector and each third keyword w3、w6And w12Term vector between cosine value.It is false If cos (w3,w5)=0.3, cos (w6,w5)=0.7, cos (w12,w5)=0.5, due to the second keyword w5With third keyword w6Between cosine value it is maximum, then terminal obtains the number that third keyword occurs in word segmentation resultWhereinEventually End calculates cosine value cos (w6,w5) and the numberProductTerminal is by second element set 5 elementsReplace with the product
In the embodiment of the present application, the term vector of target word and pass in word segmentation result of the terminal by calculating short text sample Similarity value in keyword bag of words between the term vector of each keyword obtains second to be updated to the first element set Element set.The term vector of third keyword in term vector and keyword bag of words further according to the second keyword in keyword bag of words Between similarity value replace 0 element in second element set, obtain third element set.In the third element set Each element is used to construct the machine learning model for short text classification.It can be mentioned while guaranteeing element set information content The accuracy of high machine learning model, to improve the accuracy of short text classification.
It is a schematic block diagram of sample data processing unit provided by the embodiments of the present application referring to Fig. 5.As shown in figure 5, Sample data processing unit in the embodiment of the present application includes:
First obtains module 10, includes N for obtaining the word segmentation result obtained after short text sample is segmented, and obtaining The keyword bag of words of a keyword include at least one word in the word segmentation result;
Determining module 20, for determining the first element set according to the word segmentation result and the keyword bag of words, this first It include N number of element in element set, the value of each element is each keyword in the keyword bag of words in first element set The number occurred in the word segmentation result;
Second obtains module 30, for obtaining the target word in the word segmentation result, and obtains the target word and the keyword The similarity value of each keyword in bag of words, the target word be included in the word segmentation result exist and in the keyword bag of words not Existing word;
Update module 40, for being greater than phase when the target word and the similarity value of the first keyword in the keyword bag of words When like degree threshold value, first key in first element set is updated according to the similarity value of the target word and first keyword Corresponding first element of word, obtains second element set;
Wherein, each element in the second element set is used to construct the machine learning model for short text classification.
It in some possible embodiments, include that first acquisition unit 101 and second obtains in the first acquisition module 10 Take unit 102.The first acquisition unit 101 is for obtaining the word segmentation result obtained after short text sample is segmented, the participle It as a result include at least one word in;The second acquisition unit 102 is for obtaining the keyword bag of words comprising N number of keyword.
Wherein, which is specifically used for: obtaining the training sample set for generating keyword bag of words;Root N number of keyword is determined from training sample concentration according to term frequency-inverse document frequency algorithm, which is generated according to N number of keyword Keyword bag of words.
In some possible embodiments, which is primary vector, includes N number of member in the primary vector Element, the primary vector meet:
Wherein, the V1Indicate the primary vector, it shouldIndicate n-th of keyword in keyword bag of words in the participle knot The number occurred in fruit, the value range of the n are 1 natural number for arriving N.
In some possible embodiments, which includes that third acquiring unit 301 and the 4th obtain Unit 302.The third acquiring unit 301 is used to obtain the target word in the word segmentation result, which is included in the participle knot The word for existing in fruit and being not present in the keyword bag of words;4th acquiring unit 302 is for obtaining the target word and the pass The similarity value of each keyword in keyword bag of words.
Wherein, the 4th acquiring unit 302 is specifically used for: the term vector of the target word is obtained from term vector database, And from the term vector for obtaining each keyword in the keyword bag of words in the term vector database;Calculate the term vector of the target word Similarity value between the term vector of each keyword.The update module 40 is specifically used for: if the term vector of the target word It is greater than similarity threshold with the similarity value in the keyword bag of words between the term vector of the first keyword, then calculates the similarity The product for the number that value and the target word occur in the word segmentation result;First keyword in first element set is corresponding The first element be updated to the sum of the product and first element, obtain second element set.
In some possible embodiments, the device further include third obtain module the 50, the 4th obtain module 60 and Replacement module 70.
The third obtains module 50, for obtaining corresponding M the second keyword of M 0 elements in the second element set; 4th obtains module 60, and for obtaining at least one third keyword in the keyword bag of words, the third keyword is at this The number occurred in word segmentation result is greater than or equal to 1;The replacement module 70, for according in the M the second keywords each the Similarity value in two keywords and the keyword bag of words between each third keyword replaces 0 yuan in the second element set Element obtains third element set.Wherein, each element in the third element set is used to construct the machine for short text classification Device learning model.
In some possible embodiments, which is specifically used for:
From the term vector for obtaining each second keyword in the M the second keywords in term vector database, and from the word The term vector of each third keyword of at least one third keyword in the keyword bag of words is obtained in vector data library;It obtains Each third is crucial in the term vector and at least one third keyword of any second keyword i in the M the second keywords Similarity value between the term vector of word;If the in the term vector of second keyword i and at least one third keyword Similarity value between the term vector of three keyword m is maximum, then by second keyword i corresponding 0 in the second element set Element replaces with the similarity value between the second keyword i and third keyword m with the third keyword m in the participle knot The product of the number occurred in fruit obtains third element set.
In some possible embodiments, which is combined into third vector, includes N number of member in the third vector Element, the third vector meet:
Wherein, the V3Indicate the third vector, the wmIndicate that m-th of keyword in keyword bag of words is third keyword,Indicate frequency of occurrence of m-th of keyword in the word segmentation result in keyword bag of words,The wiIndicate keyword word I-th of keyword in bag is the second keyword, [cos (wm,wi)]maxIt indicates m-th of keyword in keyword bag of words and closes The similarity value of i-th of keyword in keyword bag of words is maximum.
In the specific implementation, above-mentioned sample data processing unit can execute above-mentioned Fig. 1, Fig. 2 or Fig. 4 by above-mentioned modules Implementation provided by each step in provided implementation realizes the function of being realized in the various embodiments described above, tool Body can be found in the corresponding description that each step provides in above-mentioned Fig. 1, Fig. 2 or embodiment of the method shown in Fig. 4, no longer superfluous herein It states.
In the embodiment of the present application, the participle knot after sample data processing unit is segmented by acquisition short text sample Fruit, and the keyword bag of words comprising N number of keyword are obtained, the first element is determined according to the word segmentation result and the keyword bag of words Set, obtain the target word in the word segmentation result, and obtain the target word in the keyword bag of words each keyword it is similar Angle value, when the similarity value of the first keyword in the target word and the keyword bag of words is greater than similarity threshold, then basis should Target word and the similarity value of first keyword update corresponding first element of first keyword in first element set, Second element set is obtained, each element in the second element set is used to construct the machine learning mould for short text classification Type can increase the information content in the element set of short text sample, improve and constructed using the element set of the short text sample Machine learning model performance, and then improve short text classification accuracy.
It is a schematic block diagram of terminal provided by the embodiments of the present application referring to Fig. 6.As shown in fig. 6, the embodiment of the present application In terminal may include: one or more processors 601 and memory 602.Above-mentioned processor 601 and memory 602 pass through Bus 603 connects.For memory 602 for storing computer program, the computer program includes program instruction, processor 601 For executing the program instruction of the storage of memory 602.Wherein, processor 601 is configured for calling described program instruction execution:
The word segmentation result obtained after short text sample is segmented is obtained, and obtains the keyword word comprising N number of keyword Bag, it include at least one word in the word segmentation result;
The first element set is determined according to the word segmentation result and the keyword bag of words, includes N in first element set A element, the value of each element is that each keyword goes out in the word segmentation result in the keyword bag of words in first element set Existing number;
The target word in the word segmentation result is obtained, and obtains the phase of the target word with each keyword in the keyword bag of words Like angle value, which includes the word for existing in the word segmentation result and being not present in the keyword bag of words;
If the similarity value of the first keyword in the target word and the keyword bag of words is greater than similarity threshold, basis It is first yuan corresponding that the target word with the similarity value of first keyword updates first keyword in first element set Element obtains second element set;
Wherein, each element in the second element set is used to construct the machine learning model for short text classification.
It should be appreciated that in the embodiment of the present application, alleged processor 601 can be central processing unit (Central Processing Unit, CPU), which can also be other general processors, digital signal processor (Digital Signal Processor, DSP), specific integrated circuit (Application Specific Integrated Circuit, ASIC), ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic Device, discrete gate or transistor logic, discrete hardware components etc..General processor can be microprocessor or this at Reason device is also possible to any conventional processor etc..
The memory 602 may include read-only memory and random access memory, and to processor 601 provide instruction and Data.The a part of of memory 602 can also include nonvolatile RAM.For example, memory 602 can also be deposited Store up the information of device type.
In the specific implementation, sample provided by the embodiments of the present application can be performed in processor 601 described in the embodiment of the present application The realization side of keyword bag of words generation method provided by the embodiments of the present application also can be performed in the implementation of notebook data processing method Formula can also carry out the implementation of sample data processing unit described in the embodiment of the present application, and details are not described herein.
The embodiment of the present application also provides a kind of computer readable storage medium, which has meter Calculation machine program, the computer program include program instruction, which realizes Fig. 1, sample shown in Fig. 4 when being executed by processor Notebook data processing method or keyword bag of words generation method shown in Fig. 2, detail please refer to real shown in Fig. 1, Fig. 2 or Fig. 4 The description of example is applied, details are not described herein.
Above-mentioned computer readable storage medium can be sample data processing unit or electricity described in aforementioned any embodiment The internal storage unit of sub- equipment, such as the hard disk or memory of electronic equipment.The computer readable storage medium is also possible to this The plug-in type hard disk being equipped on the External memory equipment of electronic equipment, such as the electronic equipment, intelligent memory card (smart Media card, SMC), secure digital (secure digital, SD) card, flash card (flash card) etc..Further, The computer readable storage medium can also both including the electronic equipment internal storage unit and also including External memory equipment.It should Computer readable storage medium is for other programs and data needed for storing the computer program and the electronic equipment.The meter Calculation machine readable storage medium storing program for executing can be also used for temporarily storing the data that has exported or will export.
Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps, can be realized with electronic hardware, computer software, or a combination of the two, in order to clearly demonstrate hardware With the interchangeability of software, each exemplary composition and step are generally described according to function in the above description.This A little functions are implemented in hardware or software actually, the specific application and design constraint depending on technical solution.Specially Industry technical staff can use different methods to achieve the described function each specific application, but this realization is not It is considered as beyond scope of the present application.
The application is referring to the method, apparatus of the embodiment of the present application and the flow chart and/or box of computer program product Figure describes.It should be understood that each process and/or the side in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in frame and flowchart and/or the block diagram.These computer program instructions be can provide to logical With the processor of the processing equipments of computer, special purpose computer, Embedded Processor or other programmable diagnosis and treatment data to generate One machine, so that generating use by the instruction that the processor of computer or the processing equipment of other programmable diagnosis and treatment data executes In the dress for realizing the function of specifying in one or more flows of the flowchart and/or one or more blocks of the block diagram It sets.
These computer program instructions, which may also be stored in, to be able to guide processing of computer or other programmable diagnosis and treatment data and sets In standby computer-readable memory operate in a specific manner, so that instruction stored in the computer readable memory generates Manufacture including command device, the command device are realized in one or more flows of the flowchart and/or one, block diagram The function of being specified in box or multiple boxes.
These computer program instructions can also be loaded into the processing equipment of computer or other programmable diagnosis and treatment data, be made It obtains and executes series of operation steps on a computer or other programmable device to generate computer implemented processing, thus counting The instruction executed on calculation machine or other programmable devices is provided for realizing in one or more flows of the flowchart and/or side The step of function of being specified in block diagram one box or multiple boxes.
Although the application is described in conjunction with specific features and embodiment, it is clear that, do not departing from this Shen In the case where spirit and scope please, it can be carry out various modifications and is combined.Correspondingly, the specification and drawings are only institute The exemplary illustration for the application that attached claim is defined, and be considered as covered within the scope of the application any and all and repair Change, change, combining or equivalent.Obviously, those skilled in the art the application can be carried out various modification and variations without It is detached from spirit and scope.If in this way, these modifications and variations of the application belong to the claim of this application and its Within the scope of equivalent technologies, then the application is also intended to include these modifications and variations.

Claims (10)

1. a kind of sample data processing method characterized by comprising
The word segmentation result obtained after short text sample is segmented is obtained, and obtains the keyword bag of words comprising N number of keyword, institute Stating includes at least one word in word segmentation result;
The first element set is determined according to the word segmentation result and the keyword bag of words, includes in first element set N number of element, in first element set value of each element be in the keyword bag of words each keyword in the participle As a result the number occurred in;
The target word in the word segmentation result is obtained, and obtains each keyword in the target word and the keyword bag of words Similarity value, the target word include the word for existing in the word segmentation result and being not present in the keyword bag of words;
If the similarity value of the first keyword in the target word and the keyword bag of words is greater than similarity threshold, basis The target word is corresponding with the first keyword described in the similarity value of first keyword update first element set The first element, obtain second element set;
Wherein, each element in the second element set is used to construct the machine learning model for short text classification.
2. the method according to claim 1, wherein described obtain the keyword bag of words comprising N number of keyword, packet It includes:
Obtain the training sample set for generating keyword bag of words;
N number of keyword is determined from training sample concentration according to term frequency-inverse document frequency algorithm, according to N number of key Word generates the keyword bag of words.
3. method according to claim 1 or 2, which is characterized in that first element set is primary vector, described the Include N number of element in one vector, the primary vector meets:
Wherein, the V1Indicate the primary vector, it is describedIndicate n-th of keyword in keyword bag of words in the participle As a result the number occurred in, the value range of the n are 1 natural number for arriving N.
4. method according to claim 1-3, which is characterized in that described to obtain the target word and the key The similarity value of each keyword in word bag of words, comprising:
The term vector of the target word is obtained from term vector database, and obtains the key from the term vector database The term vector of each keyword in word bag of words;
Calculate the similarity value between the term vector of the target word and the term vector of each keyword;
If the similarity value of the first keyword in the target word and the keyword bag of words is greater than similarity threshold, The first keyword described in first element set is updated according to the similarity value of the target word and first keyword Corresponding first element, obtains second element set, comprising:
If the similarity value in the term vector of the target word and the keyword bag of words between the term vector of the first keyword is big In similarity threshold, then the product for the number that the similarity value occurs in the word segmentation result with the target word is calculated;
Corresponding first element of first keyword described in first element set is updated to the product and described first The sum of element obtains second element set.
5. method according to claim 1-4, which is characterized in that it is described obtain the second element set it Afterwards, the method also includes:
Obtain corresponding M the second keyword of M 0 elements in the second element set;
At least one third keyword in the keyword bag of words is obtained, the third keyword goes out in the word segmentation result Existing number is greater than or equal to 1;
According in each second keyword in the M the second keywords and the keyword bag of words between each third keyword Similarity value replace 0 element in the second element set, obtain third element set;
Wherein, each element in the third element set is used to construct the machine learning model for short text classification.
6. according to the method described in claim 5, it is characterized in that, described according to each second in the M the second keywords Similarity value in keyword and the keyword bag of words between each third keyword replaces 0 in the second element set Element obtains third element set, comprising:
From the term vector for obtaining each second keyword in the M the second keywords in term vector database, and from institute's predicate The term vector of each third keyword of at least one third keyword in the keyword bag of words is obtained in vector data library;
It obtains in the M the second keywords in the term vector and at least one described third keyword of any second keyword i Similarity value between the term vector of each third keyword;
If the term vector of the third keyword m in the term vector of the second keyword i and at least one described third keyword Between similarity value it is maximum, then corresponding 0 element of the second keyword i described in the second element set is replaced with described Similarity value and the third keyword m between second keyword i and the third keyword m go out in the word segmentation result The product of existing number obtains third element set.
7. according to the method described in claim 6, it is characterized in that, the third element collection is combined into third vector, the third Include N number of element in vector, the third vector meets:
Wherein, the V3Indicate the third vector, the wmIndicate that m-th of keyword in keyword bag of words is that third is crucial Word,Indicate frequency of occurrence of m-th of keyword in the word segmentation result in keyword bag of words,The wiIt indicates I-th of keyword in keyword bag of words is the second keyword, [cos (wm,wi)]maxIndicate m-th of pass in keyword bag of words The similarity value of i-th of keyword in keyword and keyword bag of words is maximum.
8. a kind of sample data processing unit characterized by comprising
First obtains module, includes N number of key for obtaining the word segmentation result obtained after short text sample is segmented, and obtaining The keyword bag of words of word include at least one word in the word segmentation result;
Determining module, for determining the first element set according to the word segmentation result and the keyword bag of words, described first It include N number of element in element set, the value of each element is each pass in the keyword bag of words in first element set The number that keyword occurs in the word segmentation result;
Second obtains module, for obtaining the target word in the word segmentation result, and obtains the target word and the keyword The similarity value of each keyword in bag of words, the target word, which is included in the word segmentation result, to be existed and in the keyword word The word being not present in bag;
Update module, for working as the target word to the similarity value of the first keyword in the keyword bag of words greater than similar When spending threshold value, the is updated described in first element set according to the similarity value of the target word and first keyword Corresponding first element of one keyword, obtains second element set;
Wherein, each element in the second element set is used to construct the machine learning model for short text classification.
9. a kind of terminal, which is characterized in that including processor and memory, the processor and memory are connected with each other, wherein The memory is for storing computer program, and the computer program includes program instruction, and the processor is configured for Described program instruction is called, the method according to claim 1 to 7 is executed.
10. a kind of computer readable storage medium, which is characterized in that the computer storage medium is stored with computer program, The computer program includes program instruction, and described program instruction makes the processor execute such as right when being executed by a processor It is required that the described in any item methods of 1-7.
CN201811421160.8A 2018-11-26 2018-11-26 Sample data processing method and device Active CN109508378B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201811421160.8A CN109508378B (en) 2018-11-26 2018-11-26 Sample data processing method and device
PCT/CN2019/088803 WO2020107835A1 (en) 2018-11-26 2019-05-28 Sample data processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811421160.8A CN109508378B (en) 2018-11-26 2018-11-26 Sample data processing method and device

Publications (2)

Publication Number Publication Date
CN109508378A true CN109508378A (en) 2019-03-22
CN109508378B CN109508378B (en) 2023-07-14

Family

ID=65750624

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811421160.8A Active CN109508378B (en) 2018-11-26 2018-11-26 Sample data processing method and device

Country Status (2)

Country Link
CN (1) CN109508378B (en)
WO (1) WO2020107835A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020107835A1 (en) * 2018-11-26 2020-06-04 平安科技(深圳)有限公司 Sample data processing method and device
CN111353050A (en) * 2019-12-27 2020-06-30 北京合力亿捷科技股份有限公司 Word stock construction method and tool in vertical field of telecommunication customer service
CN111625468A (en) * 2020-06-05 2020-09-04 中国银行股份有限公司 Test case duplicate removal method and device
CN113011533A (en) * 2021-04-30 2021-06-22 平安科技(深圳)有限公司 Text classification method and device, computer equipment and storage medium
CN113779959A (en) * 2021-08-31 2021-12-10 西南电子技术研究所(中国电子科技集团公司第十研究所) Small sample text data mixing enhancement method
CN117370809A (en) * 2023-11-02 2024-01-09 快朵儿(广州)云科技有限公司 Artificial intelligence model construction method, system and storage medium based on deep learning

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113312481A (en) * 2021-05-27 2021-08-27 中国平安人寿保险股份有限公司 Text classification method, device and equipment based on block chain and storage medium
CN114548261A (en) * 2022-02-18 2022-05-27 北京百度网讯科技有限公司 Data processing method, data processing device, electronic equipment and storage medium
CN117009519A (en) * 2023-07-19 2023-11-07 上交所技术有限责任公司 Enterprise leaning industry method based on word bag model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100235343A1 (en) * 2009-03-13 2010-09-16 Microsoft Corporation Predicting Interestingness of Questions in Community Question Answering
US20140317074A1 (en) * 2013-04-23 2014-10-23 Microsoft Corporation Automatic Taxonomy Construction From Keywords
CN104199959A (en) * 2014-09-18 2014-12-10 浪潮软件集团有限公司 Text classification method for Internet tax-related data
CN104462244A (en) * 2014-11-19 2015-03-25 武汉大学 Smart city heterogeneous data sharing method based on meta model

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622378A (en) * 2011-01-30 2012-08-01 北京千橡网景科技发展有限公司 Method and device for detecting events from text flow
US20160253597A1 (en) * 2015-02-27 2016-09-01 Xerox Corporation Content-aware domain adaptation for cross-domain classification
CN105488023B (en) * 2015-03-20 2019-01-11 广州爱九游信息技术有限公司 A kind of text similarity appraisal procedure and device
CN105045875B (en) * 2015-07-17 2018-06-12 北京林业大学 Personalized search and device
CN107103012A (en) * 2016-01-28 2017-08-29 阿里巴巴集团控股有限公司 Recognize method, device and the server of violated webpage
CN109508378B (en) * 2018-11-26 2023-07-14 平安科技(深圳)有限公司 Sample data processing method and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100235343A1 (en) * 2009-03-13 2010-09-16 Microsoft Corporation Predicting Interestingness of Questions in Community Question Answering
US20140317074A1 (en) * 2013-04-23 2014-10-23 Microsoft Corporation Automatic Taxonomy Construction From Keywords
CN104199959A (en) * 2014-09-18 2014-12-10 浪潮软件集团有限公司 Text classification method for Internet tax-related data
CN104462244A (en) * 2014-11-19 2015-03-25 武汉大学 Smart city heterogeneous data sharing method based on meta model

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2020107835A1 (en) * 2018-11-26 2020-06-04 平安科技(深圳)有限公司 Sample data processing method and device
CN111353050A (en) * 2019-12-27 2020-06-30 北京合力亿捷科技股份有限公司 Word stock construction method and tool in vertical field of telecommunication customer service
CN111625468A (en) * 2020-06-05 2020-09-04 中国银行股份有限公司 Test case duplicate removal method and device
CN111625468B (en) * 2020-06-05 2024-04-16 中国银行股份有限公司 Test case duplicate removal method and device
CN113011533A (en) * 2021-04-30 2021-06-22 平安科技(深圳)有限公司 Text classification method and device, computer equipment and storage medium
CN113011533B (en) * 2021-04-30 2023-10-24 平安科技(深圳)有限公司 Text classification method, apparatus, computer device and storage medium
CN113779959A (en) * 2021-08-31 2021-12-10 西南电子技术研究所(中国电子科技集团公司第十研究所) Small sample text data mixing enhancement method
CN117370809A (en) * 2023-11-02 2024-01-09 快朵儿(广州)云科技有限公司 Artificial intelligence model construction method, system and storage medium based on deep learning
CN117370809B (en) * 2023-11-02 2024-04-12 快朵儿(广州)云科技有限公司 Artificial intelligence model construction method, system and storage medium based on deep learning

Also Published As

Publication number Publication date
CN109508378B (en) 2023-07-14
WO2020107835A1 (en) 2020-06-04

Similar Documents

Publication Publication Date Title
CN109508378A (en) A kind of sample data processing method and processing device
CN108121700A (en) A kind of keyword extracting method, device and electronic equipment
CN108197109A (en) A kind of multilingual analysis method and device based on natural language processing
CN103207860B (en) The entity relation extraction method and apparatus of public sentiment event
CN105045875B (en) Personalized search and device
CN108288067A (en) Training method, bidirectional research method and the relevant apparatus of image text Matching Model
CN105243129A (en) Commodity property characteristic word clustering method
CN106570144A (en) Method and apparatus for recommending information
CN112148889A (en) Recommendation list generation method and device
CN106202042A (en) A kind of keyword abstraction method based on figure
CN106598949B (en) A kind of determination method and device of word to text contribution degree
CN103116588A (en) Method and system for personalized recommendation
CN108228758A (en) A kind of file classification method and device
CN111316296A (en) Structure of learning level extraction model
CN110134792A (en) Text recognition method, device, electronic equipment and storage medium
CN107291825A (en) With the search method and system of money commodity in a kind of video
CN110008309A (en) A kind of short phrase picking method and device
CN106445906A (en) Generation method and apparatus for medium-and-long phrase in domain lexicon
CN104199838B (en) A kind of user model constructing method based on label disambiguation
CN105893362A (en) A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
CN113722478B (en) Multi-dimensional feature fusion similar event calculation method and system and electronic equipment
CN109614626A (en) Keyword Automatic method based on gravitational model
CN107169061A (en) A kind of text multi-tag sorting technique for merging double information sources
CN106445907A (en) Domain lexicon generation method and apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant