CN108304387A - The recognition methods of noise word, device, server group and storage medium in text - Google Patents
The recognition methods of noise word, device, server group and storage medium in text Download PDFInfo
- Publication number
- CN108304387A CN108304387A CN201810195233.XA CN201810195233A CN108304387A CN 108304387 A CN108304387 A CN 108304387A CN 201810195233 A CN201810195233 A CN 201810195233A CN 108304387 A CN108304387 A CN 108304387A
- Authority
- CN
- China
- Prior art keywords
- word
- text
- noise
- training
- identified
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
Abstract
This application provides the recognition methods of noise word, device, server group and storage medium, methods in a kind of text to include:Obtain text to be identified, each word in text to be identified is converted into word vector successively, obtain word vector set corresponding with text to be identified, word vector set corresponding with text to be identified is inputted to the noise word identification model pre-established, obtain the recognition result of noise word in the text to be identified of noise word identification model output, wherein, noise word identification model is combined into training sample with the corresponding word vector set of the training text for being labelled with noise word and is trained to obtain.The recognition methods of noise word can be identified text to be identified by the noise word identification model pre-established in text provided by the present application, since noise word identification model trains to obtain based on the training text for being labelled with noise word, therefore, noise word can be identified from text to be identified by the noise word identification model.
Description
Technical field
The present invention relates to the recognition methods of noise word, device, clothes in field of artificial intelligence more particularly to a kind of text
Device group of being engaged in and storage medium.
Background technology
Natural language processing is mostly important one of the subdomains of artificial intelligence field, be current popular translation system,
The technological core of interactive system, question answering system.The lack of standard of the text generated in real world is to influence natural language
One of main factor of process performance, and lack of standard caused by noise word is especially notable.
Wherein, noise word refers to not in stop words range, but the meaningless word under current context.Noise word with it is opposite
Fixed stop words is different, and is not fixed, and the noise word in certain texts is possible to not be noise word in other texts, than
Number 12 in such as " 12 the 5th middle school " is meaningless noise word here, but it is not noise word to be placed in " mid-December " just, this
Noise word is caused to be difficult to.
Invention content
In view of this, the present invention provides the recognition methods of noise word, device, server group and storages in a kind of text to be situated between
Matter, to solve the problems, such as that noise word is difficult in the prior art, its technical solution is as follows:
The recognition methods of noise word in a kind of text, including:
Obtain text to be identified;
Each word in the text to be identified is converted into word vector successively, is obtained corresponding with the text to be identified
Word vector set;
Word vector set corresponding with the text to be identified is inputted into the noise word identification model pre-established, obtains institute
State the recognition result of noise word in the text to be identified of noise word identification model output, wherein the noise word identifies mould
Type is combined into training sample with the corresponding word vector set of the training text for being labelled with noise word and is trained to obtain.
Wherein, the text to be identified includes target word;
The training text includes the target word;
The noise identification model to be labelled with the target word as the corresponding word vector set of the training text of noise word,
And it is labelled with the corresponding word vector set of training text that the target word is non-noise word and is combined into training sample and is trained
It arrives;
The recognition result of noise word is used to indicate whether the target word is noise word in the text to be identified.
Wherein, the process of the noise word identification model is pre-established, including:
Multiple texts for being labelled with noise word are obtained, training text set is formed;
By each word in the training text in the training text set be converted to successively word vector, obtain with it is described
The corresponding word vector set of training text, wherein the distance between different word vectors characterize the association between its corresponding word
Property;
It is input by the corresponding word vector set cooperation of the training text, training Recognition with Recurrent Neural Network obtains training
Recognition with Recurrent Neural Network is as the noise word identification model.
Wherein, each word in the training text by the training text set is converted to word vector successively,
Including:
Each word in training text in the training text set is processed into vector data successively, and will be described
Vector data is converted to word vector, obtains word vector set corresponding with the training text.
Wherein, the recognition methods of noise word further includes in the text:
Obtain the mapping of the corresponding vector data of each word and corresponding word vector that occur in the training text set
Relationship;
Each word by the text to be identified is converted to word vector successively, including:
Each word in the text to be identified is converted into vector data as target vector data successively, and is based on
The corresponding vector data of each word occurred in the training text set and the mapping relations of corresponding word vector are by the mesh
Mark vector data is converted to word vector.
The identification device of noise word in a kind of text, including:Text acquisition module to be identified, text conversion module to be identified
With noise identification module;
The text acquisition module to be identified, for obtaining text to be identified;
The text conversion module to be identified, for by each word in the text to be identified be converted to successively word to
Amount obtains word vector set corresponding with the text to be identified;
The noise identification module, for pre-establish word vector set input corresponding with the text to be identified
Noise word identification model obtains the recognition result of noise word in the text to be identified of the noise word identification model output,
Wherein, the noise word identification model is combined into training sample with the corresponding word vector set of the training text for being labelled with noise word and carries out
Training obtains.
Wherein, the text to be identified includes target word;
The training text includes the target word;
The noise identification model to be labelled with the target word as the corresponding word vector set of the training text of noise word,
And it is labelled with the corresponding word vector set of training text that the target word is non-noise word and is combined into training sample and is trained
It arrives;
The recognition result of noise word is used to indicate whether the target word is noise word in the text to be identified.
The identification device of noise word in the text further includes:Training text acquisition module, training text conversion module and
Training module;
The training text acquisition module forms training text set for obtaining multiple texts for being labelled with noise word;
The training text conversion module, for by each word in the training text in the training text set according to
It is secondary to be converted to word vector, obtain word vector set corresponding with the training text, wherein the distance between different word vectors table
Levy the relevance between its corresponding word;
The training module, for being input, training cycle nerve by the corresponding word vector set cooperation of the training text
Network, the Recognition with Recurrent Neural Network that training is obtained is as the noise word identification model.
A kind of server group, including:Memory and processor;
The memory, for storing program;
The processor, for executing described program, to carry out following operation:
Obtain text to be identified;
Each word in the text to be identified is converted into word vector successively, is obtained corresponding with the text to be identified
Word vector set;
Word vector set corresponding with the text to be identified is inputted into the noise word identification model pre-established, obtains institute
State the recognition result of noise word in the text to be identified of noise word identification model output, wherein the noise word identifies mould
Type is combined into training sample with the corresponding word vector set of the training text for being labelled with noise word and is trained to obtain.
A kind of readable storage medium storing program for executing, is stored thereon with computer program, which is characterized in that the computer program is handled
When device executes, each step of the recognition methods of noise word in text as mentioned is realized.
Above-mentioned technical proposal has the advantages that:
The recognition methods of noise word, device, server group and storage medium in text provided by the invention obtain wait for first
It identifies text, each word in text to be identified is then converted into word vector successively, is obtained corresponding with text to be identified
Word vector set corresponding with text to be identified will finally be inputted the noise word identification model pre-established by word vector set,
Obtain noise word identification model output the text to be identified in noise word recognition result, due to noise word identification model with
The corresponding word vector set of training text for being labelled with noise word is combined into training sample and is trained to obtain, and therefore, passes through noise word
Identification model can identify noise word from text to be identified.The recognition methods of noise word makes user in text provided by the invention
Stronger domain knowledge is not needed, it is only necessary to be labeled, realize simply, and identified to training text at the initial stage of training pattern
Accuracy rate is higher.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below
There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this
The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis
The attached drawing of offer obtains other attached drawings.
Fig. 1 is a flow diagram of the recognition methods of noise word in text provided in an embodiment of the present invention;
Fig. 2 is the flow diagram of the realization method provided in an embodiment of the present invention for pre-establishing noise word identification model;
Fig. 3 is another flow diagram of the recognition methods of noise word in text provided in an embodiment of the present invention;
Fig. 4 is the flow diagram of the realization method provided in an embodiment of the present invention for pre-establishing noise word identification model;
Fig. 5 is the structural schematic diagram of the identification device of noise word in text provided in an embodiment of the present invention;
Fig. 6 is the structural schematic diagram of server group provided in an embodiment of the present invention.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
An embodiment of the present invention provides a kind of recognition methods of noise word in text, referring to Fig. 1, showing the identification side
The flow diagram of method may include:
Step S101:Obtain text to be identified.
Step S102:Each word in text to be identified is converted into word vector successively, is obtained and text pair to be identified
The word vector set answered.
Step S103:Word vector set corresponding with text to be identified is inputted to the noise word identification model pre-established,
Obtain the recognition result of noise word in the text to be identified of noise word identification model output.
Wherein, noise word identification model is combined into training sample with the corresponding word vector set of the training text for being labelled with noise word
It is trained to obtain.
Referring to Fig. 2, showing a kind of possible realization process for pre-establishing noise word identification model in the present embodiment
Flow diagram, may include:
Step S201:Multiple texts for being labelled with noise word are obtained, training text set is formed.
Specifically, obtain multiple texts first, acquiring way can with but be not limited to choose from existing corpus, logical
It crosses web crawlers to crawl from network, then the noise word in each text is labeled respectively, to obtain multiple marks
The text of noise word is noted, the text for being each labelled with noise word is a training text, these are labelled with to the text of noise word
This composition training text set.Preferably, the text of different field can be obtained, so that the noise word identification model established adapts to not
Same application field.
Step S202:Each word in training text in training text set is converted into word vector successively, is obtained
Word vector set corresponding with training text.
Wherein, the distance between different word vectors characterize the relevance between its corresponding word.For example, training text collection
There are a large amount of " First People's Hospital ", " second central hospital " etc. and " hospital " relevant training text in conjunction, then into row vector
After conversion, the distance between " doctor " corresponding word vector word vector corresponding with " institute " is relatively close, i.e., " cures " pass between " institute "
Connection property is stronger, and " people " with " in " two words do not occur largely simultaneously, therefore, " people " corresponding word vector with " in " it is corresponding
The distance between word vector farther out, i.e., " people " and " in " between relevance it is weaker.
Specifically, the process for each word in the training text in training text set being converted to word vector successively can
To include:Each word in training text in training text set is processed into vector data successively, and by vector data
Word vector is converted to, word vector set corresponding with training text is obtained.
Step S203:It is input by the corresponding word vector set cooperation of training text, training Recognition with Recurrent Neural Network will be trained
The Recognition with Recurrent Neural Network arrived is as noise word identification model.
Wherein, Recognition with Recurrent Neural Network can with but be not limited to RNN, LSTM, GRU etc. and carry the neural network mould of memory function
Type.
The recognition methods of noise word in text provided in an embodiment of the present invention, obtains text to be identified first, then will wait for
Each word in identification text is converted to word vector successively, obtains word vector set corresponding with text to be identified, finally will
Word vector set corresponding with text to be identified inputs the noise word identification model pre-established, and it is defeated to obtain noise word identification model
The recognition result of noise word in the text to be identified gone out, due to noise word identification model to be labelled with the training text pair of noise word
The word vector set answered is combined into training sample and is trained to obtain, therefore, can be from text to be identified by noise word identification model
Identify noise word.The recognition methods of noise word in text provided in an embodiment of the present invention can directly carry out text to be identified complete
Literary text analyzing determines in text to be identified whether include noise word.Recognition methods provided in an embodiment of the present invention makes user
Do not need stronger domain knowledge, it is only necessary to training text is labeled at the initial stage of training pattern, therefore is realized simply, and
Recognition accuracy is higher, in addition, since training text is selected from multiple and different fields, this method is applicable to multiple and different
Field, i.e. the scope of application is wider.
Referring to Fig. 3, showing that another flow of the recognition methods of noise word in text provided in an embodiment of the present invention is shown
It is intended to, which may include:
Step S301:Obtain the text to be identified for including target word.
Step S302:Each word in text to be identified is converted into word vector successively, is obtained and text pair to be identified
The word vector set answered.
Step S303:Word vector set corresponding with text to be identified is inputted to the noise word identification model pre-established,
Obtain the output of noise word identification model, target word in instruction text to be identified whether be noise word recognition result.
Wherein, noise word identification model trains to obtain by the corresponding word vector set of training text, wherein training text
Include target word, specifically, noise identification model is to be labelled with target word as the corresponding word vector of the training text of noise word
Set, and be labelled with the corresponding word vector set of training text that target word is non-noise word and be combined into training sample and be trained
It arrives.
Referring to Fig. 4, showing a kind of possible realization process for pre-establishing noise word identification model in the present embodiment
Flow diagram, may include:
Step S401:The text being labeled to the target word in the text comprising target word is obtained, training text is formed
Set.
Specifically, first obtain include target word multiple texts, acquiring way can with but be not limited to from existing language
Material is chosen in library, is crawled from network by web crawlers, is then labeled respectively to the target word in each text, marks
It is also non-noise word that the target word, which is noted, as noise word, to obtain multiple texts being labeled to target word, by these to mesh
The text composition training text set that mark word is labeled.Preferably, it obtains and includes target word and belong to the multiple of different field
Text, so that the noise word identification model established can adapt to different application fields.
Step S402:Each word in training text in training text set is converted into word vector successively, is obtained
Word vector set corresponding with training text.
Wherein, the distance between different word vectors characterize the relevance between its corresponding word.
Specifically, the process for each word in the training text in training text set being converted to word vector successively can
To include:Each word in training text in training text set is processed into vector data successively, and by vector data
Word vector is converted to, word vector set corresponding with training text is obtained.
In one possible implementation, can all words occurred in training text set be subjected to one-hot codings,
To complete conversion of the text data to the accessible vector data of computer.It should be noted that one-hot codings are also single heat
Point coding gives each word occurred in training text set one unique coding.
Specifically, if a total of N kinds word, each word can use the vector of N-1 dimensions in training text set
It is indicated, all positions of the N-1 n dimensional vector ns of the first word are 0, and the first position of second of word is 1, the third text
The second position of word is 1, and so on.
Illustratively, there are two sentences in training text set:" 123 First People's Hospital " and " liberation is all the way ", then instruct
Practice in text collection and share 12 kinds of words, respectively:" 1 ", " 2 ", " 3 ", " ", " one ", " people ", " people ", " doctor ", " institute ",
" solution ", " putting ", " road " 12 kinds of words, then the corresponding coding of word is followed successively by above-mentioned 12:
" 1 " is corresponding to be encoded to:[0,0,0,0,0,0,0,0,0,0,0,0]
" 2 " are corresponding to be encoded to:[1,0,0,0,0,0,0,0,0,0,0,0]
" 3 " are corresponding to be encoded to:[0,1,0,0,0,0,0,0,0,0,0,0]
" " is corresponding to be encoded to:[0,0,1,0,0,0,0,0,0,0,0,0]
" one " is corresponding to be encoded to:[0,0,0,1,0,0,0,0,0,0,0,0]
……
" putting " is corresponding to be encoded to:[0,0,0,0,0,0,0,0,0,0,1,0]
" road " is corresponding to be encoded to:[0,0,0,0,0,0,0,0,0,0,0,1]
After by each word processing at vector data, the methods of word2vec can be used and be converted to each vector data
Word vector.It can get the corresponding vector data of all words occurred in training text set and word vector by the above process,
Based on this, for each training sample, it includes the corresponding word vector of all words can determine, it includes all texts
The corresponding word vector of word forms word vector set, can so obtain the corresponding word vector set of training sample.
In addition, in order to realize conversion of the word in follow-up text to be identified to word vector, training text set can be stored
The mapping relations of the corresponding vector data of all words and corresponding word vector of middle appearance.Specifically, will wait knowing in step S102
The realization process that each word in other text is converted to word vector successively may include:By each word in text to be identified
Vector data is converted to successively as target vector data, the correspondence based on above-mentioned vector data Yu word vector, by target
Vector data is converted to word vector.Specifically, it is searched and target vector data in the correspondence of vector data and word vector
Corresponding word vector.
Step S403:It is input by the corresponding word vector set cooperation of training text, training Recognition with Recurrent Neural Network will be trained
The Recognition with Recurrent Neural Network arrived is as noise word identification model.
Wherein, Recognition with Recurrent Neural Network can with but be not limited to RNN, LSTM, GRU etc. and carry the neural network mould of memory function
Type.
The recognition methods of noise word in text provided in an embodiment of the present invention obtains the text to be identified for including target word first
This, is then converted to word vector successively by each word in text to be identified, obtains word vector corresponding with text to be identified
Word vector set corresponding with text to be identified is finally inputted the noise word identification model pre-established, obtains noise by set
The recognition result of word identification model output, since noise word identification model is to be labelled with target word as the training text pair of noise word
The word vector set answered, and be labelled with the corresponding word vector set of training text that target word is non-noise word and be combined into training sample
It is trained to obtain, therefore, may recognize that whether the target word in text to be identified is noise word by noise word identification model.
The recognition methods of noise word in text provided in an embodiment of the present invention can analyze the target word in text to be identified, really
Whether the target word in fixed text to be identified is noise word.Method provided in an embodiment of the present invention does not need stronger industry and knows
Know, it is only necessary to training text is labeled at the initial stage of training pattern, therefore realized simply, also, since noise word identifies
Context environmental of the model residing for target word judges whether target word is noise word, and therefore, recognition accuracy is higher, separately
Outside, since training text is selected from multiple and different fields, this method is applicable to multiple and different fields, the i.e. scope of application
It is relatively wide.
The embodiment of the present invention additionally provides a kind of identification device of noise word in text, referring to Fig. 5, showing the identification
The structural schematic diagram of device may include:Text acquisition module 501, text conversion module to be identified 502 and noise to be identified are known
Other module 503.Wherein:
Text acquisition module 501 to be identified, for obtaining text to be identified.
Text conversion module 502 to be identified is obtained for each word in text to be identified to be converted to word vector successively
Obtain word vector set corresponding with text to be identified.
Noise identification module 503, for word vector set corresponding with text to be identified to be inputted the noise pre-established
Word identification model obtains the recognition result of noise word in the text to be identified of noise word identification model output.
Wherein, noise word identification model is combined into training sample with the corresponding word vector set of the training text for being labelled with noise word
It is trained to obtain.
The identification device of noise word in text provided in an embodiment of the present invention, using the noise word identification mould pre-established
Type analyzes text to be identified, determines in text to be identified whether include noise word.Identification provided in an embodiment of the present invention
Device makes user not need stronger domain knowledge, it is only necessary to training text is labeled at the initial stage of training pattern, because
This is realized simply, and recognition accuracy is higher, in addition, since training text is selected from multiple and different fields, this method can fit
For multiple and different fields, i.e. the scope of application is wider.
In one possible implementation, in above-described embodiment text acquisition module 501 to be identified obtain it is to be identified
Text includes target word, also includes correspondingly, in training text target word.Noise identification model is to make an uproar to be labelled with target word
The corresponding word vector set of training text of sound word, and be labelled with target word be non-noise word the corresponding word of training text to
Quantity set is combined into training sample and is trained to obtain.The identification knot of noise word in the text to be identified that noise identification module 503 exports
Fruit is used to indicate whether target word is noise word.
In one possible implementation, the identification device of noise word in the text that above-described embodiment provides, can be with
Including:Training text acquisition module, training text conversion module and training module.Wherein:
Training text acquisition module forms training text set for obtaining multiple texts for being labelled with noise word.
Training text conversion module, for being converted to each word in the training text in training text set successively
Word vector, obtains word vector set corresponding with training text.
Wherein, the distance between different word vectors characterize the relevance between its corresponding word.
Training module, for being input by the corresponding word vector set cooperation of training text, training Recognition with Recurrent Neural Network will instruct
The Recognition with Recurrent Neural Network got is as noise word identification model.
Wherein, training text conversion module is specifically used for each word in the training text in training text set
It is processed into vector data successively, and vector data is converted into word vector, obtains word vector set corresponding with training text.
Above-described embodiment provide text in noise word identification device, can also include:Mapping relations acquisition module.
Mapping relations acquisition module, for obtain the corresponding vector data of each word occurred in training text set with
The mapping relations of corresponding word vector.
Text conversion module 502 to be identified, specifically for each word in text to be identified is converted to vector successively
Data are as target vector data, and based on the corresponding vector data of each word and corresponding word occurred in training text set
The target vector data are converted to word vector by the mapping relations of vector.
The embodiment of the present invention additionally provides a kind of server group, which may include:Memory 601 and processor
602。
Memory 601, for storing program;
Processor 602, for executing described program, to carry out following operation:
Obtain text to be identified;
Each word in the text to be identified is converted into word vector successively, is obtained corresponding with the text to be identified
Word vector set;
Word vector set corresponding with the text to be identified is inputted into the noise word identification model pre-established, obtains institute
State the recognition result of noise word in the text to be identified of noise word identification model output, wherein the noise word identifies mould
Type is combined into training sample with the corresponding word vector set of the training text for being labelled with noise word and is trained to obtain.
The embodiment of the present invention additionally provides a kind of readable storage medium storing program for executing, is stored thereon with computer program, which is characterized in that
When the computer program is executed by processor, the recognition methods of noise word in the text that any of the above-described embodiment provides is realized
Each step.
Each embodiment is described by the way of progressive in this specification, the highlights of each of the examples are with other
The difference of embodiment, just to refer each other for identical similar portion between each embodiment.
In several embodiments provided herein, it should be understood that disclosed method, apparatus and equipment, it can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only a kind of division of logic function, formula that in actual implementation, there may be another division manner, such as multiple units or component
It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or
The mutual coupling, direct-coupling or communication connection discussed can be by some communication interfaces, between device or unit
Coupling or communication connection are connect, can be electrical, machinery or other forms.
The unit illustrated as separating component may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, you can be located at a place, or may be distributed over multiple
In network element.Some or all of unit therein can be selected according to the actual needs to realize the mesh of this embodiment scheme
's.In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also be each
Unit physically exists alone, can also be during two or more units are integrated in one unit.
It, can be with if the function is realized in the form of SFU software functional unit and when sold or used as an independent product
It is stored in a computer read/write memory medium.Based on this understanding, technical scheme of the present invention is substantially in other words
The part of the part that contributes to existing technology or the technical solution can be expressed in the form of software products, the meter
Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be
People's computer, server or network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention.
And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited
The various media that can store program code such as reservoir (RAM, RandomAccess Memory), magnetic disc or CD.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention.
Various modifications to these embodiments will be apparent to those skilled in the art, as defined herein
General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention
It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one
The widest range caused.
Claims (10)
1. the recognition methods of noise word in a kind of text, which is characterized in that including:
Obtain text to be identified;
Each word in the text to be identified is converted into word vector successively, obtains word corresponding with the text to be identified
Vector set;
Word vector set corresponding with the text to be identified is inputted into the noise word identification model pre-established, obtains described make an uproar
Sound word identification model output the text to be identified in noise word recognition result, wherein the noise word identification model with
The corresponding word vector set of training text for being labelled with noise word is combined into training sample and is trained to obtain.
2. the recognition methods of noise word in text according to claim 1, which is characterized in that wrapped in the text to be identified
Include target word;
The training text includes the target word;
The noise identification model to be labelled with the target word as the corresponding word vector set of the training text of noise word, and
The corresponding word vector set of training text that the target word is non-noise word is labelled with to be combined into training sample and be trained to obtain;
The recognition result of noise word is used to indicate whether the target word is noise word in the text to be identified.
3. the recognition methods of noise word in text according to claim 1, which is characterized in that pre-establish the noise word
The process of identification model, including:
Multiple texts for being labelled with noise word are obtained, training text set is formed;
Each word in training text in the training text set is converted into word vector successively, is obtained and the training
The corresponding word vector set of text, wherein the distance between different word vectors characterize the relevance between its corresponding word;
It is input, training Recognition with Recurrent Neural Network, the cycle that training is obtained by the corresponding word vector set cooperation of the training text
Neural network is as the noise word identification model.
4. the recognition methods of noise word in text according to claim 3, which is characterized in that described by the training text
Each word in training text in set is converted to word vector successively, including:
Each word in training text in the training text set is processed into vector data successively, and by the vector
Data are converted to word vector, obtain word vector set corresponding with the training text.
5. the recognition methods of noise word in text according to claim 4, which is characterized in that the method further includes:
Obtain the mapping relations of the corresponding vector data of each word and corresponding word vector that occur in the training text set;
Each word by the text to be identified is converted to word vector successively, including:
Each word in the text to be identified is converted into vector data as target vector data successively, and based on described
The corresponding vector data of each word occurred in training text set swears the target with the mapping relations of corresponding word vector
Amount data are converted to word vector.
6. the identification device of noise word in a kind of text, which is characterized in that including:Text acquisition module to be identified, text to be identified
This conversion module and noise identification module;
The text acquisition module to be identified, for obtaining text to be identified;
The text conversion module to be identified, for each word in the text to be identified to be converted to word vector successively,
Obtain word vector set corresponding with the text to be identified;
The noise identification module, for word vector set corresponding with the text to be identified to be inputted the noise pre-established
Word identification model obtains the recognition result of noise word in the text to be identified of the noise word identification model output, wherein
The noise word identification model is combined into training sample with the corresponding word vector set of the training text for being labelled with noise word and is trained
It obtains.
7. the identification device of noise word in text according to claim 6, which is characterized in that wrapped in the text to be identified
Include target word;
The training text includes the target word;
The noise identification model to be labelled with the target word as the corresponding word vector set of the training text of noise word, and
The corresponding word vector set of training text that the target word is non-noise word is labelled with to be combined into training sample and be trained to obtain;
The recognition result of noise word is used to indicate whether the target word is noise word in the text to be identified.
8. the identification device of noise word in text according to claim 6, which is characterized in that further include:Training text obtains
Modulus block, training text conversion module and training module;
The training text acquisition module forms training text set for obtaining multiple texts for being labelled with noise word;
The training text conversion module, for turning each word in the training text in the training text set successively
It is changed to word vector, obtains word vector set corresponding with the training text, wherein the distance between different word vectors characterize it
Relevance between corresponding word;
The training module, for being to input by the corresponding word vector set cooperation of the training text, training Recognition with Recurrent Neural Network,
The Recognition with Recurrent Neural Network that training is obtained is as the noise word identification model.
9. a kind of server group, which is characterized in that including:Memory and processor;
The memory, for storing program;
The processor, for executing described program, to carry out following operation:
Obtain text to be identified;
Each word in the text to be identified is converted into word vector successively, obtains word corresponding with the text to be identified
Vector set;
Word vector set corresponding with the text to be identified is inputted into the noise word identification model pre-established, obtains described make an uproar
Sound word identification model output the text to be identified in noise word recognition result, wherein the noise word identification model with
The corresponding word vector set of training text for being labelled with noise word is combined into training sample and is trained to obtain.
10. a kind of readable storage medium storing program for executing, is stored thereon with computer program, which is characterized in that the computer program is handled
When device executes, each step such as the recognition methods of noise word in text described in any one of claim 1 to 5 is realized.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810195233.XA CN108304387B (en) | 2018-03-09 | 2018-03-09 | Method, device, server group and storage medium for recognizing noise words in text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810195233.XA CN108304387B (en) | 2018-03-09 | 2018-03-09 | Method, device, server group and storage medium for recognizing noise words in text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN108304387A true CN108304387A (en) | 2018-07-20 |
CN108304387B CN108304387B (en) | 2021-06-15 |
Family
ID=62849431
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810195233.XA Active CN108304387B (en) | 2018-03-09 | 2018-03-09 | Method, device, server group and storage medium for recognizing noise words in text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108304387B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109147767A (en) * | 2018-08-16 | 2019-01-04 | 平安科技(深圳)有限公司 | Digit recognition method, device, computer equipment and storage medium in voice |
CN109271526A (en) * | 2018-08-14 | 2019-01-25 | 阿里巴巴集团控股有限公司 | Method for text detection, device, electronic equipment and computer readable storage medium |
CN110196909A (en) * | 2019-05-14 | 2019-09-03 | 北京来也网络科技有限公司 | Text denoising method and device based on intensified learning |
CN111079854A (en) * | 2019-12-27 | 2020-04-28 | 联想(北京)有限公司 | Information identification method, device and storage medium |
US11379660B2 (en) * | 2019-06-27 | 2022-07-05 | International Business Machines Corporation | Deep learning approach to computing spans |
CN111079854B (en) * | 2019-12-27 | 2024-04-23 | 联想(北京)有限公司 | Information identification method, equipment and storage medium |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030135356A1 (en) * | 2002-01-16 | 2003-07-17 | Zhiwei Ying | Method and apparatus for detecting prosodic phrase break in a text to speech (TTS) system |
CN104462378A (en) * | 2014-12-09 | 2015-03-25 | 北京国双科技有限公司 | Data processing method and device for text recognition |
CN105956179A (en) * | 2016-05-30 | 2016-09-21 | 上海智臻智能网络科技股份有限公司 | Data filtering method and apparatus |
CN106202330A (en) * | 2016-07-01 | 2016-12-07 | 北京小米移动软件有限公司 | The determination methods of junk information and device |
CN106407971A (en) * | 2016-09-14 | 2017-02-15 | 北京小米移动软件有限公司 | Text recognition method and device |
CN106815192A (en) * | 2015-11-27 | 2017-06-09 | 北京国双科技有限公司 | Model training method and device and sentence emotion identification method and device |
CN106919554A (en) * | 2016-10-27 | 2017-07-04 | 阿里巴巴集团控股有限公司 | The recognition methods of invalid word and device in document |
CN107122416A (en) * | 2017-03-31 | 2017-09-01 | 北京大学 | A kind of Chinese event abstracting method |
US20170293597A1 (en) * | 2016-04-07 | 2017-10-12 | Khalifa University Of Science, Technology And Research | Methods and systems for data processing |
CN107273362A (en) * | 2017-07-04 | 2017-10-20 | 联想(北京)有限公司 | Data processing method and its equipment |
-
2018
- 2018-03-09 CN CN201810195233.XA patent/CN108304387B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030135356A1 (en) * | 2002-01-16 | 2003-07-17 | Zhiwei Ying | Method and apparatus for detecting prosodic phrase break in a text to speech (TTS) system |
CN104462378A (en) * | 2014-12-09 | 2015-03-25 | 北京国双科技有限公司 | Data processing method and device for text recognition |
CN106815192A (en) * | 2015-11-27 | 2017-06-09 | 北京国双科技有限公司 | Model training method and device and sentence emotion identification method and device |
US20170293597A1 (en) * | 2016-04-07 | 2017-10-12 | Khalifa University Of Science, Technology And Research | Methods and systems for data processing |
CN105956179A (en) * | 2016-05-30 | 2016-09-21 | 上海智臻智能网络科技股份有限公司 | Data filtering method and apparatus |
CN106202330A (en) * | 2016-07-01 | 2016-12-07 | 北京小米移动软件有限公司 | The determination methods of junk information and device |
CN106407971A (en) * | 2016-09-14 | 2017-02-15 | 北京小米移动软件有限公司 | Text recognition method and device |
CN106919554A (en) * | 2016-10-27 | 2017-07-04 | 阿里巴巴集团控股有限公司 | The recognition methods of invalid word and device in document |
CN107122416A (en) * | 2017-03-31 | 2017-09-01 | 北京大学 | A kind of Chinese event abstracting method |
CN107273362A (en) * | 2017-07-04 | 2017-10-20 | 联想(北京)有限公司 | Data processing method and its equipment |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109271526A (en) * | 2018-08-14 | 2019-01-25 | 阿里巴巴集团控股有限公司 | Method for text detection, device, electronic equipment and computer readable storage medium |
CN109147767A (en) * | 2018-08-16 | 2019-01-04 | 平安科技(深圳)有限公司 | Digit recognition method, device, computer equipment and storage medium in voice |
CN110196909A (en) * | 2019-05-14 | 2019-09-03 | 北京来也网络科技有限公司 | Text denoising method and device based on intensified learning |
US11379660B2 (en) * | 2019-06-27 | 2022-07-05 | International Business Machines Corporation | Deep learning approach to computing spans |
CN111079854A (en) * | 2019-12-27 | 2020-04-28 | 联想(北京)有限公司 | Information identification method, device and storage medium |
CN111079854B (en) * | 2019-12-27 | 2024-04-23 | 联想(北京)有限公司 | Information identification method, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN108304387B (en) | 2021-06-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108304387A (en) | The recognition methods of noise word, device, server group and storage medium in text | |
CN106815192B (en) | Model training method and device and sentence emotion recognition method and device | |
CN104049755B (en) | Information processing method and device | |
CN106777013A (en) | Dialogue management method and apparatus | |
CN106095834A (en) | Intelligent dialogue method and system based on topic | |
CN108920654A (en) | A kind of matched method and apparatus of question and answer text semantic | |
CN108711420A (en) | Multilingual hybrid model foundation, data capture method and device, electronic equipment | |
CN106682192A (en) | Method and device for training answer intention classification model based on search keywords | |
CN106844587B (en) | It is a kind of for talking with the data processing method and device of interactive system | |
CN107480196A (en) | A kind of multi-modal lexical representation method based on dynamic fusion mechanism | |
CN109308254A (en) | A kind of test method, device and test equipment | |
CN111182162A (en) | Telephone quality inspection method, device, equipment and storage medium based on artificial intelligence | |
CN109086265A (en) | A kind of semanteme training method, multi-semantic meaning word disambiguation method in short text | |
CN110262965A (en) | A kind of test method and equipment of application program | |
CN105373853A (en) | Stock public opinion index prediction method and device | |
CN105989067A (en) | Method for generating text abstract from image, user equipment and training server | |
CN108804525A (en) | A kind of intelligent Answering method and device | |
CN104951434B (en) | The determination method and apparatus of brand mood | |
CN107330009A (en) | Descriptor disaggregated model creation method, creating device and storage medium | |
CN109902157A (en) | A kind of training sample validation checking method and device | |
CN108829777A (en) | A kind of the problem of chat robots, replies method and device | |
CN110532562A (en) | Neural network training method, Chinese idiom misuse detection method, device and electronic equipment | |
CN104407699A (en) | Human-computer interaction method, device and system | |
CN110516164A (en) | A kind of information recommendation method, device, equipment and storage medium | |
CN208284230U (en) | A kind of speech recognition equipment, speech recognition system and smart machine |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |