CN108304387A - The recognition methods of noise word, device, server group and storage medium in text - Google Patents

The recognition methods of noise word, device, server group and storage medium in text Download PDF

Info

Publication number
CN108304387A
CN108304387A CN201810195233.XA CN201810195233A CN108304387A CN 108304387 A CN108304387 A CN 108304387A CN 201810195233 A CN201810195233 A CN 201810195233A CN 108304387 A CN108304387 A CN 108304387A
Authority
CN
China
Prior art keywords
word
text
noise
training
identified
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810195233.XA
Other languages
Chinese (zh)
Other versions
CN108304387B (en
Inventor
金宝宝
杨帆
张成松
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN201810195233.XA priority Critical patent/CN108304387B/en
Publication of CN108304387A publication Critical patent/CN108304387A/en
Application granted granted Critical
Publication of CN108304387B publication Critical patent/CN108304387B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines

Abstract

This application provides the recognition methods of noise word, device, server group and storage medium, methods in a kind of text to include:Obtain text to be identified, each word in text to be identified is converted into word vector successively, obtain word vector set corresponding with text to be identified, word vector set corresponding with text to be identified is inputted to the noise word identification model pre-established, obtain the recognition result of noise word in the text to be identified of noise word identification model output, wherein, noise word identification model is combined into training sample with the corresponding word vector set of the training text for being labelled with noise word and is trained to obtain.The recognition methods of noise word can be identified text to be identified by the noise word identification model pre-established in text provided by the present application, since noise word identification model trains to obtain based on the training text for being labelled with noise word, therefore, noise word can be identified from text to be identified by the noise word identification model.

Description

The recognition methods of noise word, device, server group and storage medium in text
Technical field
The present invention relates to the recognition methods of noise word, device, clothes in field of artificial intelligence more particularly to a kind of text Device group of being engaged in and storage medium.
Background technology
Natural language processing is mostly important one of the subdomains of artificial intelligence field, be current popular translation system, The technological core of interactive system, question answering system.The lack of standard of the text generated in real world is to influence natural language One of main factor of process performance, and lack of standard caused by noise word is especially notable.
Wherein, noise word refers to not in stop words range, but the meaningless word under current context.Noise word with it is opposite Fixed stop words is different, and is not fixed, and the noise word in certain texts is possible to not be noise word in other texts, than Number 12 in such as " 12 the 5th middle school " is meaningless noise word here, but it is not noise word to be placed in " mid-December " just, this Noise word is caused to be difficult to.
Invention content
In view of this, the present invention provides the recognition methods of noise word, device, server group and storages in a kind of text to be situated between Matter, to solve the problems, such as that noise word is difficult in the prior art, its technical solution is as follows:
The recognition methods of noise word in a kind of text, including:
Obtain text to be identified;
Each word in the text to be identified is converted into word vector successively, is obtained corresponding with the text to be identified Word vector set;
Word vector set corresponding with the text to be identified is inputted into the noise word identification model pre-established, obtains institute State the recognition result of noise word in the text to be identified of noise word identification model output, wherein the noise word identifies mould Type is combined into training sample with the corresponding word vector set of the training text for being labelled with noise word and is trained to obtain.
Wherein, the text to be identified includes target word;
The training text includes the target word;
The noise identification model to be labelled with the target word as the corresponding word vector set of the training text of noise word, And it is labelled with the corresponding word vector set of training text that the target word is non-noise word and is combined into training sample and is trained It arrives;
The recognition result of noise word is used to indicate whether the target word is noise word in the text to be identified.
Wherein, the process of the noise word identification model is pre-established, including:
Multiple texts for being labelled with noise word are obtained, training text set is formed;
By each word in the training text in the training text set be converted to successively word vector, obtain with it is described The corresponding word vector set of training text, wherein the distance between different word vectors characterize the association between its corresponding word Property;
It is input by the corresponding word vector set cooperation of the training text, training Recognition with Recurrent Neural Network obtains training Recognition with Recurrent Neural Network is as the noise word identification model.
Wherein, each word in the training text by the training text set is converted to word vector successively, Including:
Each word in training text in the training text set is processed into vector data successively, and will be described Vector data is converted to word vector, obtains word vector set corresponding with the training text.
Wherein, the recognition methods of noise word further includes in the text:
Obtain the mapping of the corresponding vector data of each word and corresponding word vector that occur in the training text set Relationship;
Each word by the text to be identified is converted to word vector successively, including:
Each word in the text to be identified is converted into vector data as target vector data successively, and is based on The corresponding vector data of each word occurred in the training text set and the mapping relations of corresponding word vector are by the mesh Mark vector data is converted to word vector.
The identification device of noise word in a kind of text, including:Text acquisition module to be identified, text conversion module to be identified With noise identification module;
The text acquisition module to be identified, for obtaining text to be identified;
The text conversion module to be identified, for by each word in the text to be identified be converted to successively word to Amount obtains word vector set corresponding with the text to be identified;
The noise identification module, for pre-establish word vector set input corresponding with the text to be identified Noise word identification model obtains the recognition result of noise word in the text to be identified of the noise word identification model output, Wherein, the noise word identification model is combined into training sample with the corresponding word vector set of the training text for being labelled with noise word and carries out Training obtains.
Wherein, the text to be identified includes target word;
The training text includes the target word;
The noise identification model to be labelled with the target word as the corresponding word vector set of the training text of noise word, And it is labelled with the corresponding word vector set of training text that the target word is non-noise word and is combined into training sample and is trained It arrives;
The recognition result of noise word is used to indicate whether the target word is noise word in the text to be identified.
The identification device of noise word in the text further includes:Training text acquisition module, training text conversion module and Training module;
The training text acquisition module forms training text set for obtaining multiple texts for being labelled with noise word;
The training text conversion module, for by each word in the training text in the training text set according to It is secondary to be converted to word vector, obtain word vector set corresponding with the training text, wherein the distance between different word vectors table Levy the relevance between its corresponding word;
The training module, for being input, training cycle nerve by the corresponding word vector set cooperation of the training text Network, the Recognition with Recurrent Neural Network that training is obtained is as the noise word identification model.
A kind of server group, including:Memory and processor;
The memory, for storing program;
The processor, for executing described program, to carry out following operation:
Obtain text to be identified;
Each word in the text to be identified is converted into word vector successively, is obtained corresponding with the text to be identified Word vector set;
Word vector set corresponding with the text to be identified is inputted into the noise word identification model pre-established, obtains institute State the recognition result of noise word in the text to be identified of noise word identification model output, wherein the noise word identifies mould Type is combined into training sample with the corresponding word vector set of the training text for being labelled with noise word and is trained to obtain.
A kind of readable storage medium storing program for executing, is stored thereon with computer program, which is characterized in that the computer program is handled When device executes, each step of the recognition methods of noise word in text as mentioned is realized.
Above-mentioned technical proposal has the advantages that:
The recognition methods of noise word, device, server group and storage medium in text provided by the invention obtain wait for first It identifies text, each word in text to be identified is then converted into word vector successively, is obtained corresponding with text to be identified Word vector set corresponding with text to be identified will finally be inputted the noise word identification model pre-established by word vector set, Obtain noise word identification model output the text to be identified in noise word recognition result, due to noise word identification model with The corresponding word vector set of training text for being labelled with noise word is combined into training sample and is trained to obtain, and therefore, passes through noise word Identification model can identify noise word from text to be identified.The recognition methods of noise word makes user in text provided by the invention Stronger domain knowledge is not needed, it is only necessary to be labeled, realize simply, and identified to training text at the initial stage of training pattern Accuracy rate is higher.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this The embodiment of invention for those of ordinary skill in the art without creative efforts, can also basis The attached drawing of offer obtains other attached drawings.
Fig. 1 is a flow diagram of the recognition methods of noise word in text provided in an embodiment of the present invention;
Fig. 2 is the flow diagram of the realization method provided in an embodiment of the present invention for pre-establishing noise word identification model;
Fig. 3 is another flow diagram of the recognition methods of noise word in text provided in an embodiment of the present invention;
Fig. 4 is the flow diagram of the realization method provided in an embodiment of the present invention for pre-establishing noise word identification model;
Fig. 5 is the structural schematic diagram of the identification device of noise word in text provided in an embodiment of the present invention;
Fig. 6 is the structural schematic diagram of server group provided in an embodiment of the present invention.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
An embodiment of the present invention provides a kind of recognition methods of noise word in text, referring to Fig. 1, showing the identification side The flow diagram of method may include:
Step S101:Obtain text to be identified.
Step S102:Each word in text to be identified is converted into word vector successively, is obtained and text pair to be identified The word vector set answered.
Step S103:Word vector set corresponding with text to be identified is inputted to the noise word identification model pre-established, Obtain the recognition result of noise word in the text to be identified of noise word identification model output.
Wherein, noise word identification model is combined into training sample with the corresponding word vector set of the training text for being labelled with noise word It is trained to obtain.
Referring to Fig. 2, showing a kind of possible realization process for pre-establishing noise word identification model in the present embodiment Flow diagram, may include:
Step S201:Multiple texts for being labelled with noise word are obtained, training text set is formed.
Specifically, obtain multiple texts first, acquiring way can with but be not limited to choose from existing corpus, logical It crosses web crawlers to crawl from network, then the noise word in each text is labeled respectively, to obtain multiple marks The text of noise word is noted, the text for being each labelled with noise word is a training text, these are labelled with to the text of noise word This composition training text set.Preferably, the text of different field can be obtained, so that the noise word identification model established adapts to not Same application field.
Step S202:Each word in training text in training text set is converted into word vector successively, is obtained Word vector set corresponding with training text.
Wherein, the distance between different word vectors characterize the relevance between its corresponding word.For example, training text collection There are a large amount of " First People's Hospital ", " second central hospital " etc. and " hospital " relevant training text in conjunction, then into row vector After conversion, the distance between " doctor " corresponding word vector word vector corresponding with " institute " is relatively close, i.e., " cures " pass between " institute " Connection property is stronger, and " people " with " in " two words do not occur largely simultaneously, therefore, " people " corresponding word vector with " in " it is corresponding The distance between word vector farther out, i.e., " people " and " in " between relevance it is weaker.
Specifically, the process for each word in the training text in training text set being converted to word vector successively can To include:Each word in training text in training text set is processed into vector data successively, and by vector data Word vector is converted to, word vector set corresponding with training text is obtained.
Step S203:It is input by the corresponding word vector set cooperation of training text, training Recognition with Recurrent Neural Network will be trained The Recognition with Recurrent Neural Network arrived is as noise word identification model.
Wherein, Recognition with Recurrent Neural Network can with but be not limited to RNN, LSTM, GRU etc. and carry the neural network mould of memory function Type.
The recognition methods of noise word in text provided in an embodiment of the present invention, obtains text to be identified first, then will wait for Each word in identification text is converted to word vector successively, obtains word vector set corresponding with text to be identified, finally will Word vector set corresponding with text to be identified inputs the noise word identification model pre-established, and it is defeated to obtain noise word identification model The recognition result of noise word in the text to be identified gone out, due to noise word identification model to be labelled with the training text pair of noise word The word vector set answered is combined into training sample and is trained to obtain, therefore, can be from text to be identified by noise word identification model Identify noise word.The recognition methods of noise word in text provided in an embodiment of the present invention can directly carry out text to be identified complete Literary text analyzing determines in text to be identified whether include noise word.Recognition methods provided in an embodiment of the present invention makes user Do not need stronger domain knowledge, it is only necessary to training text is labeled at the initial stage of training pattern, therefore is realized simply, and Recognition accuracy is higher, in addition, since training text is selected from multiple and different fields, this method is applicable to multiple and different Field, i.e. the scope of application is wider.
Referring to Fig. 3, showing that another flow of the recognition methods of noise word in text provided in an embodiment of the present invention is shown It is intended to, which may include:
Step S301:Obtain the text to be identified for including target word.
Step S302:Each word in text to be identified is converted into word vector successively, is obtained and text pair to be identified The word vector set answered.
Step S303:Word vector set corresponding with text to be identified is inputted to the noise word identification model pre-established, Obtain the output of noise word identification model, target word in instruction text to be identified whether be noise word recognition result.
Wherein, noise word identification model trains to obtain by the corresponding word vector set of training text, wherein training text Include target word, specifically, noise identification model is to be labelled with target word as the corresponding word vector of the training text of noise word Set, and be labelled with the corresponding word vector set of training text that target word is non-noise word and be combined into training sample and be trained It arrives.
Referring to Fig. 4, showing a kind of possible realization process for pre-establishing noise word identification model in the present embodiment Flow diagram, may include:
Step S401:The text being labeled to the target word in the text comprising target word is obtained, training text is formed Set.
Specifically, first obtain include target word multiple texts, acquiring way can with but be not limited to from existing language Material is chosen in library, is crawled from network by web crawlers, is then labeled respectively to the target word in each text, marks It is also non-noise word that the target word, which is noted, as noise word, to obtain multiple texts being labeled to target word, by these to mesh The text composition training text set that mark word is labeled.Preferably, it obtains and includes target word and belong to the multiple of different field Text, so that the noise word identification model established can adapt to different application fields.
Step S402:Each word in training text in training text set is converted into word vector successively, is obtained Word vector set corresponding with training text.
Wherein, the distance between different word vectors characterize the relevance between its corresponding word.
Specifically, the process for each word in the training text in training text set being converted to word vector successively can To include:Each word in training text in training text set is processed into vector data successively, and by vector data Word vector is converted to, word vector set corresponding with training text is obtained.
In one possible implementation, can all words occurred in training text set be subjected to one-hot codings, To complete conversion of the text data to the accessible vector data of computer.It should be noted that one-hot codings are also single heat Point coding gives each word occurred in training text set one unique coding.
Specifically, if a total of N kinds word, each word can use the vector of N-1 dimensions in training text set It is indicated, all positions of the N-1 n dimensional vector ns of the first word are 0, and the first position of second of word is 1, the third text The second position of word is 1, and so on.
Illustratively, there are two sentences in training text set:" 123 First People's Hospital " and " liberation is all the way ", then instruct Practice in text collection and share 12 kinds of words, respectively:" 1 ", " 2 ", " 3 ", " ", " one ", " people ", " people ", " doctor ", " institute ", " solution ", " putting ", " road " 12 kinds of words, then the corresponding coding of word is followed successively by above-mentioned 12:
" 1 " is corresponding to be encoded to:[0,0,0,0,0,0,0,0,0,0,0,0]
" 2 " are corresponding to be encoded to:[1,0,0,0,0,0,0,0,0,0,0,0]
" 3 " are corresponding to be encoded to:[0,1,0,0,0,0,0,0,0,0,0,0]
" " is corresponding to be encoded to:[0,0,1,0,0,0,0,0,0,0,0,0]
" one " is corresponding to be encoded to:[0,0,0,1,0,0,0,0,0,0,0,0]
……
" putting " is corresponding to be encoded to:[0,0,0,0,0,0,0,0,0,0,1,0]
" road " is corresponding to be encoded to:[0,0,0,0,0,0,0,0,0,0,0,1]
After by each word processing at vector data, the methods of word2vec can be used and be converted to each vector data Word vector.It can get the corresponding vector data of all words occurred in training text set and word vector by the above process, Based on this, for each training sample, it includes the corresponding word vector of all words can determine, it includes all texts The corresponding word vector of word forms word vector set, can so obtain the corresponding word vector set of training sample.
In addition, in order to realize conversion of the word in follow-up text to be identified to word vector, training text set can be stored The mapping relations of the corresponding vector data of all words and corresponding word vector of middle appearance.Specifically, will wait knowing in step S102 The realization process that each word in other text is converted to word vector successively may include:By each word in text to be identified Vector data is converted to successively as target vector data, the correspondence based on above-mentioned vector data Yu word vector, by target Vector data is converted to word vector.Specifically, it is searched and target vector data in the correspondence of vector data and word vector Corresponding word vector.
Step S403:It is input by the corresponding word vector set cooperation of training text, training Recognition with Recurrent Neural Network will be trained The Recognition with Recurrent Neural Network arrived is as noise word identification model.
Wherein, Recognition with Recurrent Neural Network can with but be not limited to RNN, LSTM, GRU etc. and carry the neural network mould of memory function Type.
The recognition methods of noise word in text provided in an embodiment of the present invention obtains the text to be identified for including target word first This, is then converted to word vector successively by each word in text to be identified, obtains word vector corresponding with text to be identified Word vector set corresponding with text to be identified is finally inputted the noise word identification model pre-established, obtains noise by set The recognition result of word identification model output, since noise word identification model is to be labelled with target word as the training text pair of noise word The word vector set answered, and be labelled with the corresponding word vector set of training text that target word is non-noise word and be combined into training sample It is trained to obtain, therefore, may recognize that whether the target word in text to be identified is noise word by noise word identification model. The recognition methods of noise word in text provided in an embodiment of the present invention can analyze the target word in text to be identified, really Whether the target word in fixed text to be identified is noise word.Method provided in an embodiment of the present invention does not need stronger industry and knows Know, it is only necessary to training text is labeled at the initial stage of training pattern, therefore realized simply, also, since noise word identifies Context environmental of the model residing for target word judges whether target word is noise word, and therefore, recognition accuracy is higher, separately Outside, since training text is selected from multiple and different fields, this method is applicable to multiple and different fields, the i.e. scope of application It is relatively wide.
The embodiment of the present invention additionally provides a kind of identification device of noise word in text, referring to Fig. 5, showing the identification The structural schematic diagram of device may include:Text acquisition module 501, text conversion module to be identified 502 and noise to be identified are known Other module 503.Wherein:
Text acquisition module 501 to be identified, for obtaining text to be identified.
Text conversion module 502 to be identified is obtained for each word in text to be identified to be converted to word vector successively Obtain word vector set corresponding with text to be identified.
Noise identification module 503, for word vector set corresponding with text to be identified to be inputted the noise pre-established Word identification model obtains the recognition result of noise word in the text to be identified of noise word identification model output.
Wherein, noise word identification model is combined into training sample with the corresponding word vector set of the training text for being labelled with noise word It is trained to obtain.
The identification device of noise word in text provided in an embodiment of the present invention, using the noise word identification mould pre-established Type analyzes text to be identified, determines in text to be identified whether include noise word.Identification provided in an embodiment of the present invention Device makes user not need stronger domain knowledge, it is only necessary to training text is labeled at the initial stage of training pattern, because This is realized simply, and recognition accuracy is higher, in addition, since training text is selected from multiple and different fields, this method can fit For multiple and different fields, i.e. the scope of application is wider.
In one possible implementation, in above-described embodiment text acquisition module 501 to be identified obtain it is to be identified Text includes target word, also includes correspondingly, in training text target word.Noise identification model is to make an uproar to be labelled with target word The corresponding word vector set of training text of sound word, and be labelled with target word be non-noise word the corresponding word of training text to Quantity set is combined into training sample and is trained to obtain.The identification knot of noise word in the text to be identified that noise identification module 503 exports Fruit is used to indicate whether target word is noise word.
In one possible implementation, the identification device of noise word in the text that above-described embodiment provides, can be with Including:Training text acquisition module, training text conversion module and training module.Wherein:
Training text acquisition module forms training text set for obtaining multiple texts for being labelled with noise word.
Training text conversion module, for being converted to each word in the training text in training text set successively Word vector, obtains word vector set corresponding with training text.
Wherein, the distance between different word vectors characterize the relevance between its corresponding word.
Training module, for being input by the corresponding word vector set cooperation of training text, training Recognition with Recurrent Neural Network will instruct The Recognition with Recurrent Neural Network got is as noise word identification model.
Wherein, training text conversion module is specifically used for each word in the training text in training text set It is processed into vector data successively, and vector data is converted into word vector, obtains word vector set corresponding with training text.
Above-described embodiment provide text in noise word identification device, can also include:Mapping relations acquisition module.
Mapping relations acquisition module, for obtain the corresponding vector data of each word occurred in training text set with The mapping relations of corresponding word vector.
Text conversion module 502 to be identified, specifically for each word in text to be identified is converted to vector successively Data are as target vector data, and based on the corresponding vector data of each word and corresponding word occurred in training text set The target vector data are converted to word vector by the mapping relations of vector.
The embodiment of the present invention additionally provides a kind of server group, which may include:Memory 601 and processor 602。
Memory 601, for storing program;
Processor 602, for executing described program, to carry out following operation:
Obtain text to be identified;
Each word in the text to be identified is converted into word vector successively, is obtained corresponding with the text to be identified Word vector set;
Word vector set corresponding with the text to be identified is inputted into the noise word identification model pre-established, obtains institute State the recognition result of noise word in the text to be identified of noise word identification model output, wherein the noise word identifies mould Type is combined into training sample with the corresponding word vector set of the training text for being labelled with noise word and is trained to obtain.
The embodiment of the present invention additionally provides a kind of readable storage medium storing program for executing, is stored thereon with computer program, which is characterized in that When the computer program is executed by processor, the recognition methods of noise word in the text that any of the above-described embodiment provides is realized Each step.
Each embodiment is described by the way of progressive in this specification, the highlights of each of the examples are with other The difference of embodiment, just to refer each other for identical similar portion between each embodiment.
In several embodiments provided herein, it should be understood that disclosed method, apparatus and equipment, it can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of division of logic function, formula that in actual implementation, there may be another division manner, such as multiple units or component It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be by some communication interfaces, between device or unit Coupling or communication connection are connect, can be electrical, machinery or other forms.
The unit illustrated as separating component may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, you can be located at a place, or may be distributed over multiple In network element.Some or all of unit therein can be selected according to the actual needs to realize the mesh of this embodiment scheme 's.In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, can also be each Unit physically exists alone, can also be during two or more units are integrated in one unit.
It, can be with if the function is realized in the form of SFU software functional unit and when sold or used as an independent product It is stored in a computer read/write memory medium.Based on this understanding, technical scheme of the present invention is substantially in other words The part of the part that contributes to existing technology or the technical solution can be expressed in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be People's computer, server or network equipment etc.) it performs all or part of the steps of the method described in the various embodiments of the present invention. And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited The various media that can store program code such as reservoir (RAM, RandomAccess Memory), magnetic disc or CD.
The foregoing description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be apparent to those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, of the invention It is not intended to be limited to the embodiments shown herein, and is to fit to and the principles and novel features disclosed herein phase one The widest range caused.

Claims (10)

1. the recognition methods of noise word in a kind of text, which is characterized in that including:
Obtain text to be identified;
Each word in the text to be identified is converted into word vector successively, obtains word corresponding with the text to be identified Vector set;
Word vector set corresponding with the text to be identified is inputted into the noise word identification model pre-established, obtains described make an uproar Sound word identification model output the text to be identified in noise word recognition result, wherein the noise word identification model with The corresponding word vector set of training text for being labelled with noise word is combined into training sample and is trained to obtain.
2. the recognition methods of noise word in text according to claim 1, which is characterized in that wrapped in the text to be identified Include target word;
The training text includes the target word;
The noise identification model to be labelled with the target word as the corresponding word vector set of the training text of noise word, and The corresponding word vector set of training text that the target word is non-noise word is labelled with to be combined into training sample and be trained to obtain;
The recognition result of noise word is used to indicate whether the target word is noise word in the text to be identified.
3. the recognition methods of noise word in text according to claim 1, which is characterized in that pre-establish the noise word The process of identification model, including:
Multiple texts for being labelled with noise word are obtained, training text set is formed;
Each word in training text in the training text set is converted into word vector successively, is obtained and the training The corresponding word vector set of text, wherein the distance between different word vectors characterize the relevance between its corresponding word;
It is input, training Recognition with Recurrent Neural Network, the cycle that training is obtained by the corresponding word vector set cooperation of the training text Neural network is as the noise word identification model.
4. the recognition methods of noise word in text according to claim 3, which is characterized in that described by the training text Each word in training text in set is converted to word vector successively, including:
Each word in training text in the training text set is processed into vector data successively, and by the vector Data are converted to word vector, obtain word vector set corresponding with the training text.
5. the recognition methods of noise word in text according to claim 4, which is characterized in that the method further includes:
Obtain the mapping relations of the corresponding vector data of each word and corresponding word vector that occur in the training text set;
Each word by the text to be identified is converted to word vector successively, including:
Each word in the text to be identified is converted into vector data as target vector data successively, and based on described The corresponding vector data of each word occurred in training text set swears the target with the mapping relations of corresponding word vector Amount data are converted to word vector.
6. the identification device of noise word in a kind of text, which is characterized in that including:Text acquisition module to be identified, text to be identified This conversion module and noise identification module;
The text acquisition module to be identified, for obtaining text to be identified;
The text conversion module to be identified, for each word in the text to be identified to be converted to word vector successively, Obtain word vector set corresponding with the text to be identified;
The noise identification module, for word vector set corresponding with the text to be identified to be inputted the noise pre-established Word identification model obtains the recognition result of noise word in the text to be identified of the noise word identification model output, wherein The noise word identification model is combined into training sample with the corresponding word vector set of the training text for being labelled with noise word and is trained It obtains.
7. the identification device of noise word in text according to claim 6, which is characterized in that wrapped in the text to be identified Include target word;
The training text includes the target word;
The noise identification model to be labelled with the target word as the corresponding word vector set of the training text of noise word, and The corresponding word vector set of training text that the target word is non-noise word is labelled with to be combined into training sample and be trained to obtain;
The recognition result of noise word is used to indicate whether the target word is noise word in the text to be identified.
8. the identification device of noise word in text according to claim 6, which is characterized in that further include:Training text obtains Modulus block, training text conversion module and training module;
The training text acquisition module forms training text set for obtaining multiple texts for being labelled with noise word;
The training text conversion module, for turning each word in the training text in the training text set successively It is changed to word vector, obtains word vector set corresponding with the training text, wherein the distance between different word vectors characterize it Relevance between corresponding word;
The training module, for being to input by the corresponding word vector set cooperation of the training text, training Recognition with Recurrent Neural Network, The Recognition with Recurrent Neural Network that training is obtained is as the noise word identification model.
9. a kind of server group, which is characterized in that including:Memory and processor;
The memory, for storing program;
The processor, for executing described program, to carry out following operation:
Obtain text to be identified;
Each word in the text to be identified is converted into word vector successively, obtains word corresponding with the text to be identified Vector set;
Word vector set corresponding with the text to be identified is inputted into the noise word identification model pre-established, obtains described make an uproar Sound word identification model output the text to be identified in noise word recognition result, wherein the noise word identification model with The corresponding word vector set of training text for being labelled with noise word is combined into training sample and is trained to obtain.
10. a kind of readable storage medium storing program for executing, is stored thereon with computer program, which is characterized in that the computer program is handled When device executes, each step such as the recognition methods of noise word in text described in any one of claim 1 to 5 is realized.
CN201810195233.XA 2018-03-09 2018-03-09 Method, device, server group and storage medium for recognizing noise words in text Active CN108304387B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810195233.XA CN108304387B (en) 2018-03-09 2018-03-09 Method, device, server group and storage medium for recognizing noise words in text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810195233.XA CN108304387B (en) 2018-03-09 2018-03-09 Method, device, server group and storage medium for recognizing noise words in text

Publications (2)

Publication Number Publication Date
CN108304387A true CN108304387A (en) 2018-07-20
CN108304387B CN108304387B (en) 2021-06-15

Family

ID=62849431

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810195233.XA Active CN108304387B (en) 2018-03-09 2018-03-09 Method, device, server group and storage medium for recognizing noise words in text

Country Status (1)

Country Link
CN (1) CN108304387B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109147767A (en) * 2018-08-16 2019-01-04 平安科技(深圳)有限公司 Digit recognition method, device, computer equipment and storage medium in voice
CN109271526A (en) * 2018-08-14 2019-01-25 阿里巴巴集团控股有限公司 Method for text detection, device, electronic equipment and computer readable storage medium
CN110196909A (en) * 2019-05-14 2019-09-03 北京来也网络科技有限公司 Text denoising method and device based on intensified learning
CN111079854A (en) * 2019-12-27 2020-04-28 联想(北京)有限公司 Information identification method, device and storage medium
US11379660B2 (en) * 2019-06-27 2022-07-05 International Business Machines Corporation Deep learning approach to computing spans
CN111079854B (en) * 2019-12-27 2024-04-23 联想(北京)有限公司 Information identification method, equipment and storage medium

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030135356A1 (en) * 2002-01-16 2003-07-17 Zhiwei Ying Method and apparatus for detecting prosodic phrase break in a text to speech (TTS) system
CN104462378A (en) * 2014-12-09 2015-03-25 北京国双科技有限公司 Data processing method and device for text recognition
CN105956179A (en) * 2016-05-30 2016-09-21 上海智臻智能网络科技股份有限公司 Data filtering method and apparatus
CN106202330A (en) * 2016-07-01 2016-12-07 北京小米移动软件有限公司 The determination methods of junk information and device
CN106407971A (en) * 2016-09-14 2017-02-15 北京小米移动软件有限公司 Text recognition method and device
CN106815192A (en) * 2015-11-27 2017-06-09 北京国双科技有限公司 Model training method and device and sentence emotion identification method and device
CN106919554A (en) * 2016-10-27 2017-07-04 阿里巴巴集团控股有限公司 The recognition methods of invalid word and device in document
CN107122416A (en) * 2017-03-31 2017-09-01 北京大学 A kind of Chinese event abstracting method
US20170293597A1 (en) * 2016-04-07 2017-10-12 Khalifa University Of Science, Technology And Research Methods and systems for data processing
CN107273362A (en) * 2017-07-04 2017-10-20 联想(北京)有限公司 Data processing method and its equipment

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030135356A1 (en) * 2002-01-16 2003-07-17 Zhiwei Ying Method and apparatus for detecting prosodic phrase break in a text to speech (TTS) system
CN104462378A (en) * 2014-12-09 2015-03-25 北京国双科技有限公司 Data processing method and device for text recognition
CN106815192A (en) * 2015-11-27 2017-06-09 北京国双科技有限公司 Model training method and device and sentence emotion identification method and device
US20170293597A1 (en) * 2016-04-07 2017-10-12 Khalifa University Of Science, Technology And Research Methods and systems for data processing
CN105956179A (en) * 2016-05-30 2016-09-21 上海智臻智能网络科技股份有限公司 Data filtering method and apparatus
CN106202330A (en) * 2016-07-01 2016-12-07 北京小米移动软件有限公司 The determination methods of junk information and device
CN106407971A (en) * 2016-09-14 2017-02-15 北京小米移动软件有限公司 Text recognition method and device
CN106919554A (en) * 2016-10-27 2017-07-04 阿里巴巴集团控股有限公司 The recognition methods of invalid word and device in document
CN107122416A (en) * 2017-03-31 2017-09-01 北京大学 A kind of Chinese event abstracting method
CN107273362A (en) * 2017-07-04 2017-10-20 联想(北京)有限公司 Data processing method and its equipment

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271526A (en) * 2018-08-14 2019-01-25 阿里巴巴集团控股有限公司 Method for text detection, device, electronic equipment and computer readable storage medium
CN109147767A (en) * 2018-08-16 2019-01-04 平安科技(深圳)有限公司 Digit recognition method, device, computer equipment and storage medium in voice
CN110196909A (en) * 2019-05-14 2019-09-03 北京来也网络科技有限公司 Text denoising method and device based on intensified learning
US11379660B2 (en) * 2019-06-27 2022-07-05 International Business Machines Corporation Deep learning approach to computing spans
CN111079854A (en) * 2019-12-27 2020-04-28 联想(北京)有限公司 Information identification method, device and storage medium
CN111079854B (en) * 2019-12-27 2024-04-23 联想(北京)有限公司 Information identification method, equipment and storage medium

Also Published As

Publication number Publication date
CN108304387B (en) 2021-06-15

Similar Documents

Publication Publication Date Title
CN108304387A (en) The recognition methods of noise word, device, server group and storage medium in text
CN106815192B (en) Model training method and device and sentence emotion recognition method and device
CN104049755B (en) Information processing method and device
CN106777013A (en) Dialogue management method and apparatus
CN106095834A (en) Intelligent dialogue method and system based on topic
CN108920654A (en) A kind of matched method and apparatus of question and answer text semantic
CN108711420A (en) Multilingual hybrid model foundation, data capture method and device, electronic equipment
CN106682192A (en) Method and device for training answer intention classification model based on search keywords
CN106844587B (en) It is a kind of for talking with the data processing method and device of interactive system
CN107480196A (en) A kind of multi-modal lexical representation method based on dynamic fusion mechanism
CN109308254A (en) A kind of test method, device and test equipment
CN111182162A (en) Telephone quality inspection method, device, equipment and storage medium based on artificial intelligence
CN109086265A (en) A kind of semanteme training method, multi-semantic meaning word disambiguation method in short text
CN110262965A (en) A kind of test method and equipment of application program
CN105373853A (en) Stock public opinion index prediction method and device
CN105989067A (en) Method for generating text abstract from image, user equipment and training server
CN108804525A (en) A kind of intelligent Answering method and device
CN104951434B (en) The determination method and apparatus of brand mood
CN107330009A (en) Descriptor disaggregated model creation method, creating device and storage medium
CN109902157A (en) A kind of training sample validation checking method and device
CN108829777A (en) A kind of the problem of chat robots, replies method and device
CN110532562A (en) Neural network training method, Chinese idiom misuse detection method, device and electronic equipment
CN104407699A (en) Human-computer interaction method, device and system
CN110516164A (en) A kind of information recommendation method, device, equipment and storage medium
CN208284230U (en) A kind of speech recognition equipment, speech recognition system and smart machine

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant