CN110377899A - A kind of method, apparatus and electronic equipment of determining word part of speech - Google Patents
A kind of method, apparatus and electronic equipment of determining word part of speech Download PDFInfo
- Publication number
- CN110377899A CN110377899A CN201910464521.5A CN201910464521A CN110377899A CN 110377899 A CN110377899 A CN 110377899A CN 201910464521 A CN201910464521 A CN 201910464521A CN 110377899 A CN110377899 A CN 110377899A
- Authority
- CN
- China
- Prior art keywords
- word
- speech
- words
- target
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 84
- 239000013598 vector Substances 0.000 claims abstract description 159
- 238000012549 training Methods 0.000 claims description 36
- 238000012545 processing Methods 0.000 claims description 17
- 238000001514 detection method Methods 0.000 claims description 11
- 230000036961 partial effect Effects 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 2
- 238000012937 correction Methods 0.000 description 32
- 206010028916 Neologism Diseases 0.000 description 17
- 238000010586 diagram Methods 0.000 description 13
- 238000004891 communication Methods 0.000 description 9
- 239000012141 concentrate Substances 0.000 description 9
- 230000011218 segmentation Effects 0.000 description 7
- 238000005516 engineering process Methods 0.000 description 6
- 230000006870 function Effects 0.000 description 4
- 239000011159 matrix material Substances 0.000 description 4
- 230000005236 sound signal Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- XEEYBQQBJWHFJM-UHFFFAOYSA-N Iron Chemical compound [Fe] XEEYBQQBJWHFJM-UHFFFAOYSA-N 0.000 description 2
- 230000001133 acceleration Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 2
- 230000000712 assembly Effects 0.000 description 2
- 238000000429 assembly Methods 0.000 description 2
- 238000010009 beating Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000008901 benefit Effects 0.000 description 2
- 238000007689 inspection Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000002093 peripheral effect Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000012216 screening Methods 0.000 description 2
- KLDZYURQCUYZBL-UHFFFAOYSA-N 2-[3-[(2-hydroxyphenyl)methylideneamino]propyliminomethyl]phenol Chemical compound OC1=CC=CC=C1C=NCCCN=CC1=CC=CC=C1O KLDZYURQCUYZBL-UHFFFAOYSA-N 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000003044 adaptive effect Effects 0.000 description 1
- 230000002776 aggregation Effects 0.000 description 1
- 238000004220 aggregation Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000007796 conventional method Methods 0.000 description 1
- 238000010219 correlation analysis Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 201000001098 delayed sleep phase syndrome Diseases 0.000 description 1
- 208000033921 delayed sleep phase type circadian rhythm sleep disease Diseases 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 229910052742 iron Inorganic materials 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 239000002245 particle Substances 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Biology (AREA)
- Biomedical Technology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
Abstract
The disclosure is directed to method, apparatus, electronic equipment and the computer readable storage mediums of a kind of determining word part of speech, the described method includes: obtaining the corresponding set of words of target text, and the target word it needs to be determined that part of speech is obtained from the set of words, according to the corresponding term vector of word each in the set of words, determine the similarity in the target word and the set of words between other words, based on the similarity of the word of known part of speech in the target word and other described words, the part of speech of the target word is finally determined.In the above-mentioned methods, it is higher according to the Words similarity with same or similar context, and the higher word of similarity, the biggish principle of a possibility that part of speech is identical, the part of speech of target word is calculated automatically based on target word and the similarity of the word of known part of speech, compared to the method that artificial mark obtains part of speech, efficiency and accuracy rate are higher.
Description
Technical field
This disclosure relates to network communication field more particularly to a kind of method, apparatus of determining word part of speech, electronic equipment and
Computer readable storage medium.
Background technique
The fast development of novel social media platform further reduced the threshold of people's exchange, following a large amount of nets
Network neologisms continue to bring out among the comment of user, correctly identify that network neologisms and corresponding part of speech comment processing user
By being of great significance.
In the related technology, for the determination of neologisms part of speech, the stage being labeled by artificial screening is remained at present,
There is not the automatic obtaining method for neologisms part of speech.Network neologisms enormous amount and complexity height, are marked by manual method
Efficiency is lower and result is not accurate enough.
Summary of the invention
To overcome the problems in correlation technique, the disclosure provides a kind of method, apparatus of determining word part of speech, electronics
Equipment and computer readable storage medium.
According to the first aspect of the embodiments of the present disclosure, a kind of method of determining word part of speech is provided, comprising:
The corresponding set of words of target text is obtained, and obtains the target word it needs to be determined that part of speech from the set of words
Language;
According to the corresponding term vector of word each in the set of words, the target word and the set of words are determined
In similarity between other words, wherein include the word of known part of speech in other described words;
Based on the similarity of the word of known part of speech in the target word and other described words, the target word is determined
The part of speech of language.
Optionally, the similarity based on the word of known part of speech in the target word and other described words, really
The part of speech of the fixed target word, comprising:
The word for being greater than the first similarity threshold with the similarity of the target word is chosen from other described words, is made
For the similar word of the target word;
Whether the part of speech of in the similar word language each word is recorded in detection part of speech table;
If record has the part of speech of partial words in the similar word language in the part of speech table, recorded from the part of speech table
The similar word in, obtain with the similarity of the target word it is highest known to part of speech similar word;
It will be determined as the target to the part of speech of the similar word of the highest known part of speech of the similarity of the target word
The part of speech of word.
Optionally, whether recorded in the similar word language after the part of speech of each word in detection part of speech table, it is described
Method further include:
If record has the part of speech of whole words in the similar word language in the part of speech table, the similar word is counted
The highest part of speech of frequency of occurrence in part of speech;
The highest part of speech of the frequency of occurrence is determined as to the part of speech of the target word.
Optionally, described from the similar word recorded in the part of speech table, obtain the phase with the target word
The similar word of part of speech like known to degree is highest, comprising:
According to the sequence of the similarity with the target word from high to low, the part of speech for successively detecting the similar word is
It is no to be recorded in the part of speech table, until the similar word for detecting to be recorded in the part of speech table, then by the similar word
As to the similar word of the highest known part of speech of the similarity of the target word.
Optionally, it in the similarity based on the target word with the word of known part of speech in other described words, determines
After the part of speech of the target word, the method also includes:
The part of speech of the target word and the target word is added in the part of speech table, updated part of speech is obtained
Table, the updated part of speech table are used to determine the part of speech of next target word.
Optionally, described according to the corresponding term vector of word each in the set of words, determine the target word with
Similarity in the set of words between other words, comprising:
The corresponding term vector of each word in the set of words is obtained from default term vector tables of data;
It calculates interior between the term vector of other each words in the term vector and the set of words of the target word
Product;
Calculate 2 norms of the term vector of the target word and multiplying for 2 norms of the term vector of other each words
Product;
It is determined according to the inner product to the ratio of the product similar between the target word and other described words
Degree.
Optionally, the default term vector tables of data obtains as follows:
Obtain the corresponding trained set of words of each training text;
According to the frequency of occurrence of each word in the trained set of words establish the corresponding target of each word to
Amount;
In the trained set of words, the corresponding window word of each word is determined according to the window parameter chosen in advance
Language;
Each word in the trained set of words and the corresponding window word of each described word are combined;
Window word respectively using the object vector of the word in the combination as the input of object module, in the combination
Desired output information of the object vector of language as the object module, is trained the object module, and by the mesh
The vector of hidden layer output of model is marked as term vector;
The obtained word and corresponding term vector is trained to be added to default term vector tables of data each training text
In.
Optionally, the frequency of occurrence according to each word in the trained set of words establishes each word pair
The object vector answered, comprising:
The frequency of occurrence of each word in the trained set of words is counted, and from high to low according to the frequency of occurrence
Sequence the call number of each word in the trained set of words is set;
The corresponding object vector of each word is established according to the call number.
Optionally, it in the similarity based on the target word with the word of known part of speech in other described words, determines
After the part of speech of the target word, further includes:
The similar word of the target word is arranged according to the similarity descending of the target word, obtains the mesh
Mark the similar sequence of terms of word;
The word for obtaining the first setting serial number in the similar sequence of terms, refers to word as first;
The part of speech that described first whether is recorded in the part of speech table with reference to word detected;
If so, judging whether the part of speech of the target word is consistent with the part of speech of the first reference word;
If so, determining that the part of speech of the target word is accurate.
Optionally, the similar word of the target word is being arranged according to the similarity descending of the target word,
After obtaining the similar sequence of terms of the target word, further includes:
The word to be sorted before the second setting serial number in the similar sequence of terms is obtained, refers to word as second;
Count the described second quantity with reference to part of speech word identical with the part of speech of the target word in word;
If the quantity is greater than default accounting threshold value with reference to the accounting in word total quantity described second, it is determined that described
The part of speech of target word is accurate.
Optionally, after the part of speech for determining the target word is accurate, the method also includes:
Using the accurate target word of part of speech as the first accurate set of words;
The word for obtaining third setting serial number in the similar sequence of terms, refers to word as third;
If the third belongs to the described first accurate set of words with reference to word, and the part of speech of the target word with it is described
Third is identical with reference to the part of speech of word, and the target word is similar greater than second with reference to the similarity of word to the third
Threshold value is spent, then finally determines that the part of speech of the target word is accurate;Wherein, second similarity threshold is greater than described first
Similarity threshold.
Optionally, after the part of speech for finally determining the target word is accurate, further includes:
Using the accurate target word of part of speech finally determined as the second accurate set of words;
The target word for being confirmed as the doubtful inaccuracy of part of speech is obtained, doubtful wrong set of words is obtained;
The corresponding similar set of words of the doubtful mistake word of each of described doubtful wrong set of words is obtained, is obtained
Doubtful mistake word similar word language set;
If the doubtful wrong word similar word language set includes identical word, and institute with the described second accurate set of words
The part of speech for stating identical word is different from the part of speech of the target word, it is determined that the part of speech mistake of the target word.
Optionally, after the part of speech mistake for determining the target word, the method also includes:
The corresponding similar word of target word of part of speech mistake is obtained as error correction word collection, and counts the error correction word
Concentrate the frequency of occurrence of various parts of speech;
If the error correction word is concentrated there are the part of speech that frequency of occurrence is more than or equal to 2, the highest part of speech of frequency of occurrence is used
Replace the part of speech of the target word of the part of speech mistake;
If the error correction word is concentrated there is no the part of speech that frequency of occurrence is more than or equal to 2, concentrated with the error correction word
Part of speech corresponding with the highest word of target Words similarity of the part of speech mistake, replaces the target word of the part of speech mistake
The part of speech of language.
According to the second aspect of an embodiment of the present disclosure, a kind of device of determining word part of speech is provided, comprising:
Target word obtains module, is configured as executing and obtains the corresponding set of words of target text, and from the word
The target word it needs to be determined that part of speech is obtained in set;
Similarity determining module is configured as executing according to the corresponding term vector of word each in the set of words, really
Similarity in the fixed target word and the set of words between other words, wherein include in other described words
Know the word of part of speech;
Part of speech determining module is configured as executing the word based on known part of speech in the target word and other described words
The similarity of language determines the part of speech of the target word.
Optionally, the part of speech determining module, comprising:
Similar word chooses module, and it is similar to the target word to be configured as executing the selection from other described words
Degree is greater than the word of the first similarity threshold, the similar word as the target word;
Part of speech detection module is configured as executing in detection part of speech table whether recorded each word in the similar word language
Part of speech;
First similar word obtains module, if being configured as executing record in the part of speech table has in the middle part of the similar word language
The part of speech for segmenting language obtains the similarity with the target word then from the similar word recorded in the part of speech table
The similar word of highest known part of speech;
First part of speech determining module, be configured as executing by with the highest known part of speech of the similarity of the target word
The part of speech of similar word is determined as the part of speech of the target word.
Optionally, the part of speech determining module further include:
Part of speech statistical module, if being configured as executing record in the part of speech table has whole words in the similar word language
Part of speech then counts the highest part of speech of frequency of occurrence in the part of speech of the similar word;
Second part of speech determining module is configured as execution for the highest part of speech of the frequency of occurrence and is determined as the target word
The part of speech of language.
Optionally, the described first similar word obtains module, comprising:
First similar word acquisition submodule is configured as executing according to the similarity with the target word from high to low
Sequence, whether the part of speech for successively detecting the similar word be recorded in the part of speech table, described until detecting to be recorded in
Similar word in part of speech table, then using the similar word as with the highest known part of speech of the similarity of the target word
Similar word.
Optionally, described device further include:
Part of speech table update module, being configured as executing will be described in the part of speech of the target word and the target word be added
In part of speech table, updated part of speech table is obtained, the updated part of speech table is used to determine the part of speech of next target word.
Optionally, the similarity determining module, comprising:
Term vector acquisition submodule is configured as execution and is obtained in the set of words from default term vector tables of data often
The corresponding term vector of a word;
Inner product computational submodule is configured as executing every in the term vector for calculating the target word and the set of words
Inner product between the term vector of other a words;
Product computational submodule is configured as executing 2 norms of the term vector for calculating the target word and described each
The product of 2 norms of the term vector of other words;
Similarity determines submodule, is configured as executing and determines the target according to the ratio of the inner product and the product
Similarity between word and other described words.
Optionally, the term vector acquisition submodule, comprising:
Training word acquiring unit is configured as executing the corresponding trained set of words of each training text of acquisition;
Object vector establishes unit, is configured as executing the occurrence out according to each word in the trained set of words
Number establishes the corresponding object vector of each word;
Window word determination unit is configured as executing in the trained set of words, according to the window chosen in advance
Parameter determines the corresponding window word of each word;
Assembled unit is configured as executing each word in the trained set of words and each described word pair
The window word answered is combined;
Training unit is configured to using the object vector of the word in the combination as the input of object module,
Desired output information of the object vector of window word in the combination as the object module, to the object module
It is trained, and using the vector of the hidden layer of object module output as term vector;
Adding unit is configured as training the obtained word and corresponding term vector to be added to each training text
In default term vector tables of data.
Optionally, the object vector establishes unit, comprising:
Subelement is arranged in call number, is configured as executing the occurrence out for counting each word in the trained set of words
It counts, and the call number of each word in the trained set of words is set according to the sequence of the frequency of occurrence from high to low;
Object vector establishes subelement, is configured as executing and establishes the corresponding target of each word according to the call number
Vector.
Optionally, described device further include:
Similar sequence of terms obtains module, be configured as executing by the similar word of the target word according to the mesh
The similarity descending arrangement for marking word, obtains the similar sequence of terms of the target word;
First obtains module with reference to word, is configured as executing the first setting serial number in the acquisition similar sequence of terms
Word refers to word as first;
Detection module is configured as executing the word for detecting and whether recording described first in the part of speech table with reference to word
Property;
Judgment module is configured as executing if so, judging that the part of speech of the target word and described first refers to word
Part of speech it is whether consistent;
First determines the accurate module of part of speech, is configured as executing if so, determining that the part of speech of the target word is accurate.
Optionally, described device further include:
Second obtains module with reference to word, is configured as executing obtaining and be sorted in the similar sequence of terms in the second setting
Word before serial number refers to word as second;
The identical word statistical module of part of speech is configured as executing statistics described second with reference to part of speech in word and the target
The quantity of the identical word of the part of speech of word;
Second determines the accurate module of part of speech, if being configured as executing the quantity described second with reference in word total quantity
Accounting be greater than default accounting threshold value, it is determined that the part of speech of the target word is accurate.
Optionally, described device further include:
First accurate set of words determining module, it is accurate using the accurate target word of part of speech as first to be configured as executing
Set of words;
Third obtains module with reference to word, is configured as executing third setting serial number in the acquisition similar sequence of terms
Word refers to word as third;
It is final to determine the accurate module of part of speech, belong to first accurate word with reference to word if being configured as the execution third
Language set, and the part of speech of the target word and the third are identical with reference to the part of speech of word, and the target word with it is described
Third is greater than the second similarity threshold with reference to the similarity of word, then finally determines that the part of speech of the target word is accurate;Its
In, second similarity threshold is greater than first similarity threshold.
Optionally, described device further include:
Second accurate set of words determining module is configured as executing the accurate target word of part of speech finally determined
Language is as the second accurate set of words;
Doubtful mistake set of words obtains module, is configured as executing the target for obtaining and being confirmed as the doubtful inaccuracy of part of speech
Word obtains doubtful wrong set of words;
Doubtful mistake word similar word language set obtains module, is configured as executing the acquisition doubtful wrong set of words
Each of the corresponding similar set of words of doubtful wrong word, obtain doubtful wrong word similar word language set;
Part of speech error module is determined, if being configured as executing the doubtful wrong word similar word language set and described second
Accurate set of words includes identical word, and the part of speech of the identical word is different from the part of speech of the target word, it is determined that
The part of speech mistake of the target word.
Optionally, described device further include:
Error correction word collection obtains module, is configured as executing the corresponding similar word work of target word for obtaining part of speech mistake
For error correction word collection, and count the frequency of occurrence that the error correction word concentrates various parts of speech;
First replacement module, if being configured as executing word of the error correction word concentration there are frequency of occurrence more than or equal to 2
Property, then the part of speech of the target word of the part of speech mistake is replaced with the highest part of speech of frequency of occurrence;
Second replacement module, if being configured as the execution error correction word concentration, there is no frequency of occurrence to be more than or equal to 2
Part of speech then concentrates part of speech corresponding with the highest word of target Words similarity of the part of speech mistake with the error correction word,
Replace the part of speech of the target word of the part of speech mistake.
According to the third aspect of an embodiment of the present disclosure, a kind of electronic equipment is provided, comprising: processor;It is handled for storage
The memory of device executable instruction;Wherein, the processor is configured to executing determination word part of speech as described in relation to the first aspect
Method.
According to a fourth aspect of embodiments of the present disclosure, a kind of application program/computer program product is provided, when the storage
When instruction in medium is executed by the processor of mobile terminal, so that mobile terminal is able to carry out as described in relation to the first aspect really
Determine the method for word part of speech.
The technical scheme provided by this disclosed embodiment can include the following benefits:
In embodiments of the present invention, it by obtaining the corresponding set of words of target text, and is obtained from the set of words
The target word it needs to be determined that part of speech is taken, according to the corresponding term vector of word each in the set of words, determines the target
Similarity in word and the set of words between other words is based in the target word and other described words
Know the similarity of the word of part of speech, finally determines the part of speech of the target word.In the above-mentioned methods, according to identical or
The biggish principle of a possibility that Words similarity of similar context is higher, and the higher word of similarity, and part of speech is identical, is based on
Target word and the similarity of the word of known part of speech calculate the part of speech of target word automatically, compared to artificial mark
The method for obtaining part of speech, efficiency and accuracy rate are higher.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not
The disclosure can be limited.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows and meets implementation of the invention
Example, and be used to explain the principle of the present invention together with specification.
Fig. 1 is the flow chart of the method for the first determination word part of speech shown according to an exemplary embodiment;
Fig. 2 is the flow chart of the method for second shown according to an exemplary embodiment determining word part of speech;
Fig. 3 is shown according to an exemplary embodiment a kind of using skip-gram term vector model progress term vector
The schematic diagram of habit;
Fig. 4 is that shown according to an exemplary embodiment the third determines the flow chart of the method for word part of speech;
Fig. 5 is the block diagram of the first determination word part of speech device shown according to an exemplary embodiment;
Fig. 6 is the block diagram of second shown according to an exemplary embodiment determining word part of speech device;
Fig. 7 is that shown according to an exemplary embodiment the third determines the block diagram of word part of speech device;
Fig. 8 is a kind of block diagram of optional electronic equipment shown according to an exemplary embodiment;
Fig. 9 is the block diagram of another optional electronic equipment shown according to an exemplary embodiment.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to
When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary is implemented
Embodiment described in example does not represent all embodiments consistented with the present invention.On the contrary, they are only and such as institute
The example of device and method be described in detail in attached claims, that some aspects of the invention are consistent.
Fig. 1 is the flow chart of the method for the first determination word part of speech shown according to an exemplary embodiment, such as Fig. 1 institute
Show, this method is for including the following steps in terminal.
In step s101, the corresponding set of words of target text is obtained, and obtains and is needed really from the set of words
Determine the target word of part of speech.
In embodiments of the present invention, target text can be the text information that user in the network platform delivers, text letter
Breath, which can be user comment, user delivers has a talk about, circle of friends text, article etc..In these text informations, it may wrap
Containing some network neologisms, proprietary vocabulary etc., network neologisms be the network created by modes such as partials, wrong words it is popular it is non-just
Formula language, such as " refreshing horse ", " beating call " etc.;Proprietary vocabulary is the neologism generated with the new thought of new things, such as " is clapped fast
Hand ", " people sets " etc..Due to being emerging word, part of speech is unknown, the present invention for these network neologisms and proprietary vocabulary
Embodiment is to carry out the judgement of part of speech using the network neologisms of these parts of speech to be determined and proprietary vocabulary as target word.Word
Property the characteristics of referring to using word as the basis for drawing word classification, the word of Modern Chinese can be divided into 14 kinds of parts of speech, such as: noun,
Verb, adjective, distinction word, pronoun, number, quantifier, adverbial word, preposition, conjunction, auxiliary word, modal particle, onomatopoeia, interjection.
Word segmentation processing is carried out to target text first, obtains all words for including in target text and target text
In include part of speech to be determined target word.Because Chinese word is made of individual Chinese character, many individual Chinese characters itself can not be only
Vertical that a word is used as to apply or play grammatical function, by continuous Chinese character segmentation at meaningful word, this process is just named
It segments.In order to obtain in target text part of speech to be determined target word, a new word dictionary can be pre-established, this is new
Word dictionary includes the neologisms as much as possible currently having already appeared.Participle software is updated based on the new word dictionary, uses update
Participle software afterwards carries out word segmentation processing to target text, obtains a set of words after word segmentation processing, wraps in the set of words
Include the target word it needs to be determined that part of speech.
It should be noted that target word can also include other kinds of word in addition to network neologisms and proprietary vocabulary,
For example, changing semantic and part of speech known word, the embodiment of the present invention, which does not do the type of word that target word includes, to be had
Body limits.
In step s 102, according to the corresponding term vector of word each in the set of words, the target word is determined
With the similarity in the set of words between other words, wherein include the word of known part of speech in other described words.
In embodiments of the present invention, term vector is one group of Language Modeling in embedded natural language processing (NLP) and spy
The general designation of learning art is levied, indicates that the word or expression from vocabulary is mapped to the vector of real number.It obtains in set of words
The term vector of each word can be realized the sparse expression of the insertion expression and vocabulary high dimension vector of vocabulary.
The method for obtaining the term vector of word, which may is that, is trained the word input object module of training text, obtains
To the corresponding term vector of each word.Default term vector tables of data is added in the corresponding term vector of each word, word can be preset at this
The corresponding term vector of each word in the set of words is inquired in vector data table.
Generally, the lexical word vector distance with same or similar context is smaller, and the vocabulary of same or similar context
Similarity in semantic and part of speech is higher, is based on the principle, target word and the word collection can be determined according to term vector
Similarity in conjunction between other words.Specifically, it can be based on the distance between target word and other words, such as remaining
Chordal distance, Euclidean distance, editing distance etc. calculate the similarity between word.
In addition, the part of speech in order to determine target word, other words in the set of words other than target word will be wrapped
Include the word of known part of speech.
In step s 103, the similarity based on the word of known part of speech in the target word and other described words,
Determine the part of speech of the target word.
In embodiments of the present invention, due to higher two words of similarity, a possibility that part of speech is identical, is larger, so
It can be analyzed, be handled by the part of speech to other words similar with target word in set of words, and then determine mesh
Mark the part of speech of word.
In conclusion in embodiments of the present invention, by obtaining the corresponding set of words of target text, and from the word
The target word it needs to be determined that part of speech is obtained in set, according to the corresponding term vector of word each in the set of words, is determined
Similarity in the target word and the set of words between other words, based on the target word and it is described other
The similarity of the word of known part of speech in word finally determines the part of speech of the target word.In the above-mentioned methods, according to tool
Have the Words similarity of same or similar context higher, and the higher word of similarity, a possibility that part of speech is identical biggish original
Reason, calculates the part of speech of target word based on target word and the similarity of the word of known part of speech, automatically compared to people
The method that work mark obtains part of speech, efficiency and accuracy rate are higher.
Fig. 2 is the flow chart of the method for second shown according to an exemplary embodiment determining word part of speech, the determination
Method is the alternative embodiment of the determination method in Fig. 1, as shown in Fig. 2, the method for the determination word part of speech is used in terminal,
Include the following steps.
In step s 201, the corresponding trained sequence of terms set of training text is obtained.
In embodiments of the present invention, training text can be a plurality of text information obtained from network, be based on neologisms word
Allusion quotation carries out word segmentation processing to every text message, obtains corresponding trained sequence of terms, a plurality of text information is corresponding multiple
Training sequence of terms integrates, and available one trained sequence of terms set, set can indicate are as follows: { training word sequence
Column 1, training sequence of terms 2 ... }.Obviously, also comprising the neologisms in new word dictionary in the training sequence of terms set.
In step S202, each word is established according to the frequency of occurrence of each word in the trained set of words
Corresponding object vector.
In embodiments of the present invention, the occurrence out of each word in the corresponding trained sequence of terms set of target text is counted
It counts, and determines the object vector of each word according to the frequency of occurrence of each word.The vector of word is used to indicate that mankind's language
A kind of mode of mathematicization of speech, simplest vector mode is one-hot vector form, so object vector can be
One-hot vector.
Optionally, step S202 includes the following steps S2021- step S2022:
In step S2021, count the frequency of occurrence of each word in the trained set of words, and according to it is described go out
The call number of each word in the trained set of words is arranged in the sequence of occurrence number from high to low.
In embodiments of the present invention, the occurrence out of each word in the corresponding trained sequence of terms set of target text is counted
Number, each word is ranked up from high to low according to frequency of occurrence, and the call number of each word is arranged according to sequence.Example
Such as, in training sequence of terms set, the frequency of occurrence sequence first of " people sets " this word, then by " people sets " this word
Call number be set to 1, the frequency of occurrence of " plot " this word sequence second then sets the call number of " plot " this word
It is 2.
In step S2022, the corresponding object vector of each word is established according to the call number.
In embodiments of the present invention, in the above-mentioned training set of words for having sequenced sequence, each root is according to its rope
Quotation marks can indicate with the sparse vector that a M is tieed up, i.e. one-hot vector, only in the position that the word occurs in vector
Element is just 1, and other elements are all 0.For example, set the dimension M of one-hot vector as 4, then word " the people that call number is 1
If " one-hot vector be expressed as: [1,0,0,0], the one-hot vector of word " plot " that call number is 2 are expressed as:
[0,1,0,0]。
In step S203, in the trained set of words, each word is determined according to the window parameter chosen in advance
The corresponding window word of language.
In embodiments of the present invention, the core concept of the term vector based on canonical correlation analysis, it considers that in passage
Phrase in the window of the right and left designated length of some word should have association, that is to say, that the left side of some word is several
Above, several words in the right form hereafter, so that the relationship of context should be allowed as close as possible word composition.Herein, that is, draw
The concept of the window word of word is chosen out.Assuming that window parameter is 2, then the window word that two window of left and right of word includes
Quantity is 2.For example, some set in the trained sequence of terms set trains sequence of terms as { W1,W2,W3,W4,W5,
W6,……WN, then W4Window word be respectively { W2,W3, and { W5,W6}。
In step S204, by each word in the trained set of words and the corresponding window of each described word
Word is combined.
It in embodiments of the present invention, is each of training sequence of terms word selected window word according to window parameter
Language, and the corresponding window word of each word is combined.For example, setting the word C of selection as W4, window word
Language W is { W2,W3, and { W5,W6, then the combination { C, W } obtained is { W4,{W2,W3, { W5,W6}}。
It is described respectively using the object vector of the word in the combination as the input of object module in step S205
Desired output information of the object vector of window word in combination as the object module carries out the object module
Training, and using the vector of the hidden layer of object module output as term vector.
In embodiments of the present invention, the one-hot vector of word C in { C, W } will be combined respectively as the defeated of object module
Enter, desired output information of the one-hot vector of window word W as the object module, the output of object module is word
The probability value of the window word of C, is trained object module.Wherein, which is the computation model of term vector, example
Such as using word2vec term vector learning method training skip-gram model.Skip-gram is basic three-layer neural network
Model includes input layer, one layer of hidden layer and output layer, and hidden layer does not have activation primitive, and output layer activation primitive is
Softmax is returned.The one-hot that it is current term C that it, which is inputted, is indicated, is exported as the probability value of window word, and target output is
The one-hot vector of window word W.If training set of words is { W1,W2,……WN, it is embedded in the one-hot vector of expression
Dimension is M, then the input layer of network and output layer interstitial content are N, and hidden layer interstitial content is M, input layer-hidden layer
Network weight matrix is WN*M, WN*MThe i-th row ViAs vocabulary WiTerm vector.
Fig. 3 is shown according to an exemplary embodiment a kind of using skip-gram term vector model progress term vector
The schematic diagram of habit.In Fig. 3, skip-gram term vector model includes input layer, hidden layer and output layer, skip-gram word
The one-hot of the current input word C of vector model is expressed as { 0,0,0,0,1,0 ... ... 0 }, exports the window word for C
Probability value P1,P2,P3……Pn。
During model training, if the group of the corresponding window word of current term is combined into { C, W }, network parameter is indicated
For θ, then the objective function f trained are as follows:
Wherein p (c/w;It θ) is conditional probability of the vocabulary C when window word is W.
If hidden layer-output layer network weight matrix is W 'N*M, vocabulary C corresponds to input layer-hidden layer network weight
Matrix WN*MVector be Vc, window word W correspond to hidden layer-output layer network weight matrix W 'N*MVector be Vw, then defeated
Out condition Probability p (c | w;θ) are as follows:
Take logarithm that can obtain trained objective function:
The renewal equation that network parameter θ derivation can be obtained by objective function logarithm θ, can be realized using the negative method of sampling
The Optimization Solution of θ.
The group for inputting the corresponding window word of all words is combined into { C, W } and carries out to skip-gram term vector model
Training, the term vector that may finally obtain each word indicate.
In step S206, the obtained word and corresponding term vector is trained to be added to each training text default
In term vector tables of data.
In embodiments of the present invention, according to each training text word and its corresponding term vector table that finally training obtains
Show, a preset term vector tables of data can be obtained.
Above-mentioned steps S201-S206 is the process for obtaining preset term vector tables of data, which can be of the invention real
It applies in example and implements before the corresponding set of words of acquisition target text, but do not need corresponding in each execution acquisition target text
Set of words the step of before implement, periodically the term vector tables of data can be updated, update when implement walk
The method of rapid S201-S206.
In step S207, the corresponding set of words of target text is obtained, and obtain and needed really from the set of words
Determine the target word of part of speech.
In embodiments of the present invention, above-mentioned steps S207 is referred to step S101, and details are not described herein again.
In step S208, it is corresponding from the default term vector tables of data to obtain each word in the set of words
Term vector.
In embodiments of the present invention, the corresponding term vector of each word can be obtained from default term vector tables of data.
In step S209, the word of other each words in the term vector and the set of words of the target word is calculated
Inner product between vector.
In embodiments of the present invention, if the term vector of target word is Vi=(Vi1, Vi2... ..., Vim), take set of words
In except the target word local official anticipate other words, if the term vector of other words be Vj=(Vj1, Vj2... ...,
Vjm), then the inner product per the two term vectors is (vi,vj)。
In step S210, calculate the term vector of the target word 2 norms and other each words word to
The product of 2 norms of amount.
In embodiments of the present invention, the product of 2 norms of above-mentioned two term vector is | | vi||||vj||。
In step S211, the target word and other described words are determined according to the ratio of the inner product and the product
Similarity between language.
In embodiments of the present invention, the similarity between COS distance method calculating term vector can be used, then target word
The calculation formula of similarity s between other words is as follows:
Wherein (vi,vj) it is viWith vjInner product, | | | | indicate 2 norms.
In step S212, is chosen from other described words and be greater than the first similarity with the similarity of the target word
The word of threshold value, the similar word as the target word.
In embodiments of the present invention, it is chosen from other words in set of words in addition to target word and target word
Similarity is greater than the word of the first similarity threshold as similar word.The purpose that the first similarity threshold is arranged is to protect
The similarity for demonstrate,proving similar word and target word is sufficiently high, avoids the similar word chosen too low with the similarity of target word,
So that similar word is for determining that the part of speech of target word does not have reference value.Wherein, which can root
It is preset according to actual conditions, the specific value embodiment of the present invention is not specifically limited.
In step S213, the part of speech that each word in the similar word language whether is recorded in part of speech table detected.
In embodiments of the present invention, whether the part of speech of each word in similar word is inquired by preset part of speech table
Know, if there is the word of unknown part of speech, the word of the unknown part of speech is marked.For example, the word for unknown part of speech adds
Marking symbol U.
If record has the part of speech of partial words in the similar word language in the part of speech table, S214 is thened follow the steps;If institute
Record has the part of speech of whole words in the similar word language in predicate table, thens follow the steps S216.
In step S214, if record has the part of speech of partial words in the similar word language in the part of speech table, from institute
In the similar word recorded in predicate table, the phase with the highest known part of speech of the similarity of the target word is obtained
Like word.
It if record has the part of speech of partial words in the similar word language in part of speech table, i.e., is not the word of each similar word
Property it is known that then in the similar word of known part of speech, obtain the highest word of similarity with target word.
Optionally, described from the similar word recorded in the part of speech table, obtain the phase with the target word
The similar word of part of speech like known to degree is highest, comprising:
According to the sequence of the similarity with the target word from high to low, the part of speech for successively detecting the similar word is
It is no to be recorded in the part of speech table, until the similar word for detecting to be recorded in the part of speech table, then by the similar word
As to the similar word of the highest known part of speech of the similarity of the target word.
It in embodiments of the present invention, specifically, can be according to the sequence of the similarity with target word from high to low, by mesh
Mark word is ranked up, and obtains a similar sequence of terms.Using preset part of speech table, judge in the similar sequence of terms
The word of sequence first whether be recorded in the part of speech table.If there is record, using the similar word of the sequence first as with
The similar word of the highest known part of speech of the similarity of the target word.If no record, the similar sequence of terms is judged
The word part of speech of middle sequence second whether it is known that if there is record, using the similar word of the sequence second as with the target
The similar word of the highest known part of speech of the similarity of word.Such as no record, then successively judge to sort third and fourth ... K backward
Word whether be recorded in part of speech table, then will be described similar until finding the similar word being recorded in the part of speech table
Word as to the similar word of the highest known part of speech of the similarity of the target word.
It, will be true to the part of speech of the similar word of the highest known part of speech of the similarity of the target word in step S215
It is set to the part of speech of the target word.
In embodiments of the present invention, the word part of speech identical probability higher feature high based on similarity, by similarity
The part of speech of the similar word of highest known part of speech is determined as the part of speech of the target word.
The method that the part of speech of target word is determined in above-mentioned steps S212- step S215, it is high to take full advantage of similarity
The biggish principle of the identical probability of word part of speech, and the current highest word of similarity in other words is selected to carry out target
The part of speech of word determines, so that determining the fast speed of part of speech, accuracy is higher.
In step S216, if record has the part of speech of whole words in the similar word language in the part of speech table, count
The highest part of speech of frequency of occurrence in the part of speech of the similar word.
In embodiments of the present invention, the part of speech of each word in similar sequence of terms is inquired by preset word part of speech table
Whether it is known that the word of the unknown part of speech is marked if there is the word of unknown part of speech.For example, being unknown part of speech
Word add label symbol U.
In step S217, the highest part of speech of the frequency of occurrence is determined as to the part of speech of the target word.
In embodiments of the present invention, the original high a possibility that identical as the part of speech of similar word according to the part of speech of target word
The highest part of speech of frequency of occurrence in similar word, is determined as the part of speech of the target word by reason.
For example, setting set of words { W1,W2,……WNIn known part of speech set of words be { U1,U2,……UL, target
The sequence of the similar word of word W is { W1,W2,……WK, wherein K < N.If { W1,W2,……WKBelong to { U1,
U2,……UL, then the part of speech of W can be taken as { W1,W2,……WKIn the highest part of speech of frequency of occurrence.
In above-mentioned steps S216- step S217 determine target word part of speech method, similar word part of speech
In the case where knowing, part of speech of the highest part of speech of frequency of occurrence as target word is chosen, this method does not need to search similarity most
High similar word, it is easier.
In step S218, the part of speech of the target word and the target word is added in the part of speech table, is obtained
Updated part of speech table, the updated part of speech table are used to determine the part of speech of next target word.
In embodiments of the present invention, after the part of speech of target word determines, which can be added in part of speech table, with
Enrich the word quantity in part of speech table.After part of speech table updates, if the last next target word of target word for determining part of speech
The similar word of language, then can using the last target word for determining part of speech as judgement target word words and phrases next time according to
According to.
In conclusion the method for determining word part of speech provided in an embodiment of the present invention, except with shown in embodiment one
It determines outside beneficial effect possessed by the method for word part of speech, word collection is also determined using word2vec term vector learning method
It is similar between other words to set of words to calculate target word according to term vector again for the term vector of each word in conjunction
Degree, and then the similar word of target word is obtained, and according to known to word part of speech in similar word or unknown, be respectively adopted not
Same method determines the part of speech of target word, and principle is simple, strong operability, determines that the speed of target word words and phrases is fast;And
The first similarity threshold is introduced in the method, the word of foundation and target word have when ensure that determining target word words and phrases
There is high similarity, it is ensured that the accuracy of the definitive result of target word words and phrases.
Fig. 4 is that shown according to an exemplary embodiment the third determines the flow chart of the method for word part of speech, such as Fig. 4 institute
To show, the method for the grammatical term for the character words and phrases accuracy is used in terminal, it can be executed after step 217 of the embodiment of the present invention,
It can be executed after the part of speech of all target words determines, specifically include following steps.
In step S301, the similar word of the target word is arranged according to the similarity descending of the target word
Column, obtain the similar sequence of terms of the target word.
In embodiments of the present invention, the word in similar sequence of terms is dropped according to the similarity of the target word
Sequence is ordered from large to small, i.e., be located at sequence first with the highest word of target Words similarity, serial number 1,
High word is located at sequence second with target Words similarity time, serial number 2, and so on, obtain the target word
Similar sequence of terms.
In step s 302, the part of speech accuracy of target word is determined, obtains the accurate target word of part of speech.
It is alternatively possible to be determined using part of speech accuracy of the step S303-S306 to target word;Step can also be passed through
The scheme that S307-S309 is provided determines the part of speech accuracy of target word.
In the case where executing completion step S301, instruction can be inputted, executes step S303 or execution step with selected
S307;Or it is performed simultaneously step S303 and S307.
In step S303, the word of the first setting serial number in the similar sequence of terms is obtained, as the first reference word
Language.
Specifically, the first setting serial number can be set according to the actual situation, be typically chosen and target Words similarity
Biggish word refers to word as first.Preferably, the first setting serial number can be set as 1.If the first setting serial number 1,
Then first is to be located at the primary word of sequence in similar sequence of terms with reference to word.
In step s 304, the part of speech that described first whether is recorded in the part of speech table with reference to word detected.
Optionally, the part of speech table in the embodiment be will the part of speech of the target word and the target word be added described in
In part of speech table, obtained updated part of speech table.
It, can be by the part of speech of target word and target word in determining target text after the part of speech of all target words
Part of speech table is added, obtains updated part of speech table.Because having increased many target words in updated part of speech table, if should
Target word is exactly the similar word of the target word of part of speech accuracy to be determined, then it can be determined for part of speech accuracy
Foundation is provided.
Whether the part of speech that can detecte the first reference word by the step is known.If not recorded in updated part of speech table
There is the described first part of speech with reference to word, then the part of speech of the first reference word is unknown, then can choose other words as first
Judged with reference to word, for example, select in similar sequence of terms set serial number 2 word as first with reference to word into
Row judgement.If the part of speech of the first reference word is still unknown, S307 is thened follow the steps.If recording in updated part of speech table has
The part of speech of the first reference word, thens follow the steps S305.
In step S305, if so, judging that the part of speech of the target word and described first is with reference to the part of speech of word
It is no consistent.
If first with reference to word part of speech it is known that if judge target word part of speech and this first refer to word part of speech be
It is no consistent.If the part of speech of target word is consistent with the first reference part of speech of word, S306 is thened follow the steps;If target word
Part of speech with this first with reference to word part of speech it is inconsistent, it is determined that the doubtful inaccuracy of the part of speech of target word.
It should be noted that in the presently disclosed embodiments, however, it is determined that go out the doubtful inaccuracy of part of speech of target word, then
The part of speech of the target word can be redefined.
In an alternative embodiment, however, it is determined that the doubtful inaccuracy of part of speech for going out target word can then execute step
S314。
In step S306, if so, determining that the part of speech of the target word is accurate.
In embodiments of the present invention, if the part of speech of target word is consistent with reference to the part of speech of word with first, illustrate target
Word with and oneself part of speech of most like word it is consistent, then can be accurate with the part of speech of preliminary judgement target word.
For example, first is the word of serial number 1 in sequence with reference to word, if the part of speech of target word is noun, serial number 1
Word part of speech be also noun, then the part of speech of preliminary judgement target word is accurate.
In step S307, word of the sequence before the second setting serial number in the similar sequence of terms is obtained, as
Second refers to word.
Specifically, the second setting serial number can be set according to the actual situation, it is preferable that if the similar sequence of terms
In share K word, then second set serial number can be set as K/2 or K.If the second setting serial number K/2, the second reference
Word is the word for being located at sequence first half in similar sequence of terms.
In step S308, described second is counted with reference to part of speech word identical with the part of speech of the target word in word
Quantity.
In embodiments of the present invention, according to the part of speech of target word, statistics second with reference in word with target word words and phrases
The quantity of consistent word.If the quantity is greater than default accounting threshold value with reference to the accounting in word total quantity described second,
Then follow the steps S309;If the quantity is not more than default accounting threshold value with reference to the accounting in word total quantity described second,
Then determine the doubtful inaccuracy of part of speech of the target word.
It should be noted that in the presently disclosed embodiments, however, it is determined that go out the doubtful inaccuracy of part of speech of target word, then
The part of speech of the target word can be redefined.
In an alternative embodiment, however, it is determined that the doubtful inaccuracy of part of speech for going out target word can then execute step
S314。
In step S309, if the quantity is greater than default accounting threshold with reference to the accounting in word total quantity described second
Value, it is determined that the part of speech of the target word is accurate.
In embodiments of the present invention, presetting accounting threshold value may be set according to actual conditions, such as can be set to 0.5,
If there is the word part of speech of half is identical as the part of speech of target word in i.e. second reference word, then the target word is determined
Part of speech it is accurate.
For example, the part of speech of target word is noun, second with reference in word, there is the word part of speech of quantity accounting 0.7 all
Noun, then according to the part of speech of target word and it is identical as the word part of speech that its similarity is big a possibility that big principle, determine
The part of speech of the target word is accurate.
In order to further ensure that determine result accuracy, can further execute step S310-S312:
In step s310, using the accurate target word of part of speech as the first accurate set of words;
In embodiments of the present invention, after the part of speech to target word determines, it will determine that obtained part of speech is accurate
Word is acquired, and obtains the first accurate set of words.It is subsequent can be accurate to the word part of speech in the first accurate set of words
Property is determined again.
In step S311, the word of third setting serial number in the similar sequence of terms is obtained, as third reference word
Language.
In embodiments of the present invention, third setting serial number can also be preset according to the actual situation, and optionally, third is set
Sequencing number can be 1 or 2, and correspondingly, third is word of the sequence in similar sequence of terms the 1st or the 2nd with reference to word.
In step S312, if the third belongs to the described first accurate set of words, and the target word with reference to word
The part of speech of language is identical with reference to the part of speech of word as the third, and the target word refers to the similar of word to the third
Degree is greater than the second similarity threshold, then finally determines that the part of speech of the target word is accurate;Wherein, the second similarity threshold
Value is greater than first similarity threshold.
In embodiments of the present invention, if third belongs to the first accurate set of words with reference to word, illustrate third reference word
Language is higher as the property of can refer to for determining part of speech accuracy, identical with reference to the part of speech of word as third in the part of speech of target word
In the case where, in order to further ensure that the accuracy of part of speech judgement, target word and third can be chosen with reference to the similar of word
Degree is further determined, if the two similarity is greater than the second similarity threshold th2, determines the target word again
Part of speech is accurate.Equally, in order to ensure determining the accuracy of part of speech again, the second similarity threshold th2 can be set greater than first
Similarity threshold th1.
In embodiments of the present invention, determine twice by above-mentioned, filtered out part of speech and determined accurate target word, after
The continuous target word that can filter out part of speech decision error, to be corrected to its part of speech.
Optionally, it after the part of speech for finally determining the target word is accurate, can also determine in the target word
The target word of part of speech mistake, includes the following steps S313-S316:
In step S313, using the accurate target word of part of speech finally determined as the second accurate set of words.
In embodiments of the present invention, available to be judged as the accurate target word of part of speech by determining twice, by this
A little target word acquisitions are got up, and the second accurate set of words is obtained.
In step S314, the target word for being confirmed as the doubtful inaccuracy of part of speech is obtained, obtains doubtful wrong word collection
It closes.
Correspondingly, more available after accuracy of the corresponding steps above to target word words and phrases determines
It is judged as the target word of doubtful inaccuracy, the acquisition of these words is got up, obtains doubtful wrong set of words.
In step S315, it is corresponding similar to obtain the doubtful wrong word of each of the doubtful wrong set of words
Set of words obtains doubtful wrong word similar word language set.
Specifically, for each of doubtful wrong set of words word, its corresponding similar set of words is obtained, it will
The corresponding similar word of each doubtful wrong word combines, and is named as doubtful wrong word similar word language set.
If the doubtful wrong word similar word language set does not include identical word with the described second accurate set of words, or
The part of speech of the identical word is identical as the part of speech of the target word, it is determined that the part of speech mistake of the target word.
In step S316, if the doubtful wrong word similar word language set includes with the described second accurate set of words
Identical word, and the part of speech of the identical word is different from the part of speech of the target word, it is determined that the word of the target word
Property mistake.
In embodiments of the present invention, belong to the accurate set of words if existing in doubtful mistake word similar word language set
In word, which can be named as to intersection word, because intersection word also belongs to the second accurate set of words, then said
The part of speech of bright intersection word is accurate, if target word and the intersection Words similarity are high, i.e., similarity is higher than the first similarity threshold
Value th1 then illustrates that a possibility that target word is consistent with the intersection word part of speech is very high.At this point, if the part of speech of target word
It is inconsistent with the part of speech of intersection word, then illustrate the part of speech mistake of the target word.
Optionally, after the part of speech mistake for determining target word, the part of speech of the target word of the part of speech mistake is corrected.
The correcting method can be following steps S317- step S319.
In step S317, the corresponding similar word of target word of part of speech mistake is obtained as error correction word collection, and unite
Count the frequency of occurrence that the error correction word concentrates various parts of speech.
In embodiments of the present invention, the corresponding similar word of word of part of speech mistake is obtained as error correction word collection, statistics
The error correction word concentrates the part of speech of each word, calculates the number that every kind of part of speech occurs.
In step S318, if the error correction word is concentrated there are the part of speech that frequency of occurrence is more than or equal to 2, with occurrence out
The highest part of speech of number replaces the part of speech of the target word of the part of speech mistake.
In embodiments of the present invention, if error correction word is concentrated there are the part of speech that frequency of occurrence is more than or equal to 2, illustrate entangling
There are the aggregation situations of part of speech for wrong word concentration, certain part of speech frequency of occurrence is more, then the part of speech of target word belongs to the part of speech
Probability it is bigger.So the target word for taking error correction word that the highest part of speech of frequency of occurrence is concentrated to replace the part of speech mistake
Part of speech.
In step S319, if the error correction word is concentrated, there is no the parts of speech that frequency of occurrence is more than or equal to 2, described in
Error correction word concentrates part of speech corresponding with the highest word of target Words similarity of the part of speech mistake, and it is wrong to replace the part of speech
The part of speech of target word accidentally.
In embodiments of the present invention, it if the error correction word is concentrated there is no the part of speech that frequency of occurrence is more than or equal to 2, says
It is bright to concentrate the part of speech of word different in error correction word, then the part of speech of word can not be concentrated to determine target according to error correction word
The part of speech of word.At this point it is possible to the highest word of similarity chosen with target word be concentrated from error correction word, with the word
Part of speech replace the part of speech mistake target word part of speech, the correction to target word words and phrases is completed with this.
In conclusion in embodiments of the present invention, determined twice by the part of speech accuracy to the target word,
Finally determine the target word of the doubtful inaccuracy of part of speech in the target word, and when the part of speech for determining target word is doubtful
After inaccuracy, the part of speech of the target word is corrected.The above method can accurately distinguish out the mesh that part of speech is determined mistake
Mark word, and can the target word words and phrases accurately to part of speech mistake correct so that implementing through the invention
The part of speech that the method that part of speech is determined in example determines has obtained effective inspection, and accurately corrects for the part of speech of mistake.
Table 1 is the table that 6 kinds of target words shown according to an exemplary embodiment determine part of speech using the present invention program
Schematic diagram.
Table 1
In the specific implementation process, using the new word discovery method of improved mutual imformation and adjacent entropy to targeted website 80,000,000
Video comments carry out new word discovery, find 1,500,000 neologisms altogether, by 1,500,000 neologisms be added new word dictionary and as needing it is true
The new word dictionary neologisms are added among the original word dictionary of jieba dictionary for word segmentation 350,000 the target word for determining part of speech, use
The video comments of targeted website 80,000,000 are segmented based on the jieba segmenting method of new dictionary, obtain set of words, from
In preset term vector tables of data in query terms set each word term vector.Preset term vector tables of data by pair
Skip-gram term vector model training obtains.When carrying out term vector model training, one-hot vector dimension is set as 200,
Skip-gram window parameter is set as 2, and initial learning rate is set as 0.025, carries out model training using the negative method of sampling,
Obtain training the term vector of word.It is reference with the original 350,000 word part of speech of dictionary of jieba dictionary for word segmentation, carries out target word words and phrases
Property calculate, word number K is set as 8 in similar sequence of terms, and the first similarity threshold th1 is set as 0.5, obtains 1,500,000 mesh
Mark the part of speech of word.It then carries out doubtful wrong part of speech screening and corrects, the second similarity threshold th2 of setting is 0.7, is obtained
The correction part of speech of doubtful false target word, and then obtain the final part of speech of the 1500000 target word.From having determined that final word
Property 1,500,000 target words in randomly select 20,000 words, artificial judgment is carried out to the parts of speech of 20,000 words, judges to tie
Fruit shows to judge that the accuracy of target word words and phrases is higher than 90% using present invention method.
Table 1 lists 6 kinds of target words: " people sets ", " beating call ", " net is red ", " old iron ", " quick worker " " not having defect "
Similar sequence of terms and corresponding similarity have listed file names with the part of speech that each target word finally determines.Observing result can
Know, present invention method realizes accurate part of speech and calculates.
Wherein, " target word " is the target word of part of speech to be determined in table 1, and " similar word " is corresponding for target word
Similar word, " reference " is to determine the first of part of speech accuracy with reference to word, and " similarity " is that target word is referred to first
The similarity of word, " part of speech " are finally determining part of speech.The part of speech for the target word that can finally determine, which can be seen that, adopts
Determine that the part of speech accuracy rate of 6 kinds of target words is higher with the method for the embodiment of the present invention.
Fig. 5 is the block diagram of the first determination word part of speech device shown according to an exemplary embodiment.It, should referring to Fig. 5
Device includes that target word obtains module 501, similarity determining module 502, part of speech determining module 503.
Wherein, target word obtains module 501, is configured as executing and obtains the corresponding set of words of target text, and from
The target word it needs to be determined that part of speech is obtained in the set of words;
Similarity determining module 502 is configured as executing according to the corresponding term vector of word each in the set of words,
Determine the similarity in the target word and the set of words between other words, wherein include in other described words
The word of known part of speech;
Part of speech determining module 503 is configured as executing based on known part of speech in the target word and other described words
Word similarity, determine the part of speech of the target word.
In determining word part of speech device provided in an embodiment of the present invention, by obtaining the corresponding set of words of target text,
And the target word it needs to be determined that part of speech is obtained from the set of words, it is corresponding according to word each in the set of words
Term vector, determine the similarity in the target word and the set of words between other words, be based on the target word
The similarity of the word of known part of speech, finally determines the part of speech of the target word in language and other described words.In above-mentioned side
It is higher according to the Words similarity with same or similar context in method, and the higher word of similarity, part of speech is identical can
The energy biggish principle of property, the similarity based on target word and the word of known part of speech carry out the part of speech of target word automatic
It calculates, compared to the method that artificial mark obtains part of speech, efficiency and accuracy rate are higher.
Fig. 6 is the block diagram of second shown according to an exemplary embodiment determining word part of speech device.It, should referring to Fig. 6
Device 600 includes that target word obtains module 601, similarity determining module 602, part of speech determining module 603.
Wherein, target word obtains module 601, is configured as executing and obtains the corresponding set of words of target text, and from
The target word it needs to be determined that part of speech is obtained in the set of words;
Similarity determining module 602 is configured as executing according to the corresponding term vector of word each in the set of words,
Determine the similarity in the target word and the set of words between other words, wherein include in other described words
The word of known part of speech;
Part of speech determining module 603 is configured as executing based on known part of speech in the target word and other described words
Word similarity, determine the part of speech of the target word.
Optionally, the part of speech determining module 603 includes:
Similar word chooses submodule 6031, is configured as executing and choose and the target word from other described words
Similarity be greater than the first similarity threshold word, the similar word as the target word;
Whether part of speech detection sub-module 6032 is configured as having recorded in execution detection part of speech table each in the similar word language
The part of speech of a word;
First similar word acquisition submodule 6033, if being configured as executing record in the part of speech table has the similar word
The part of speech of partial words in language obtains and the target word then from the similar word recorded in the part of speech table
The similar word of the highest known part of speech of similarity;
First part of speech determines submodule 6034, and being configured as executing will be highest known with the similarity of the target word
The part of speech of the similar word of part of speech is determined as the part of speech of the target word.
Optionally, the part of speech determining module 603 further include:
Part of speech statistic submodule 6035, if being configured as executing record in the part of speech table has in the similar word language all
The part of speech of word then counts the highest part of speech of frequency of occurrence in the part of speech of the similar word;
Second part of speech determines submodule 6036, be configured as executing the highest part of speech of the frequency of occurrence is determined as it is described
The part of speech of target word.
Optionally, the described first similar word acquisition submodule 6033, comprising:
First similar word acquiring unit is configured as executing according to the similarity with the target word from high to low
Sequentially, whether the part of speech for successively detecting the similar word is recorded in the part of speech table, until detecting to be recorded in institute's predicate
Property table in similar word, then using the similar word as with the highest known part of speech of the similarity of the target word
Similar word.
Optionally, described device 600 further include:
Part of speech table update module 604 is configured as executing the addition of the part of speech of the target word and the target word
In the part of speech table, updated part of speech table is obtained, the updated part of speech table is used to determine the word of next target word
Property.
Optionally, the similarity determining module 602, comprising:
Term vector acquisition submodule 6021 is configured as execution and obtains the set of words from default term vector tables of data
In the corresponding term vector of each word;
Inner product computational submodule 6022 is configured as executing the term vector for calculating the target word and the set of words
In other each words term vector between inner product;
Product computational submodule 6023, be configured as executing 2 norms of the term vector for calculating the target word with it is described
The product of 2 norms of the term vector of other each words;
Similarity determines submodule 6024, is configured as executing according to the determination of the ratio of the inner product and the product
Similarity between target word and other described words.
Optionally, the term vector acquisition submodule 6021, comprising:
Training word acquiring unit is configured as executing the corresponding trained set of words of each training text of acquisition;
Object vector establishes unit, is configured as executing the occurrence out according to each word in the trained set of words
Number establishes the corresponding object vector of each word;
Window word determination unit is configured as executing in the trained set of words, according to the window chosen in advance
Parameter determines the corresponding window word of each word;
Assembled unit is configured as executing each word in the trained set of words and each described word pair
The window word answered is combined;
Training unit is configured to using the object vector of the word in the combination as the input of object module,
Desired output information of the object vector of window word in the combination as the object module, to the object module
It is trained, and using the vector of the hidden layer of object module output as term vector;
Adding unit is configured as training the obtained word and corresponding term vector to be added to each training text
In default term vector tables of data.
Optionally, the object vector establishes unit, comprising:
Subelement is arranged in call number, is configured as executing the occurrence out for counting each word in the trained set of words
It counts, and the call number of each word in the trained set of words is set according to the sequence of the frequency of occurrence from high to low;
Object vector establishes subelement, is configured as executing and establishes the corresponding target of each word according to the call number
Vector.
In conclusion the device 600 of determining word part of speech provided in an embodiment of the present invention, removing has shown in Fig. 5 really
Determine outside beneficial effect possessed by the device 500 of word part of speech, word collection is also determined using word2vec term vector learning method
It is similar between other words to set of words to calculate target word according to term vector again for the term vector of each word in conjunction
Degree, and then the similar word of target word is obtained, and according to known to word part of speech in similar word or unknown, be respectively adopted not
Same method determines the part of speech of target word, and principle is simple, strong operability, determines that the speed of target word words and phrases is fast;And
The first similarity threshold is introduced in the method, the word of foundation and target word have when ensure that determining target word words and phrases
There is high similarity, it is ensured that the accuracy of the definitive result of target word words and phrases.
Fig. 7 is that shown according to an exemplary embodiment the third determines the block diagram of word part of speech device.It, should referring to Fig. 7
Device 700 includes: that similar sequence of terms obtains module 701, and first obtains module 702 with reference to word, and detection module 703 is sentenced
Disconnected module 704, first determines the accurate module 705 of part of speech.
Wherein, the similar sequence of terms obtains module 701, is configured as executing the similar word of the target word
It is arranged according to the similarity descending of the target word, obtains the similar sequence of terms of the target word;
First obtains module 702 with reference to word, is configured as executing the first setting sequence in the acquisition similar sequence of terms
Number word, as first refer to word;
Detection module 703 is configured as executing detecting described first whether is recorded in the part of speech table with reference to word
Part of speech;
Judgment module 704, be configured as execute if so, judge the target word part of speech and first reference word
Whether the part of speech of language is consistent;
First determines the accurate module 705 of part of speech, is configured as executing if so, determining that the part of speech of the target word is quasi-
Really.
Optionally, described device 700 further include:
Second obtains module 706 with reference to word, is configured as executing obtaining in the similar sequence of terms and sort second
The word before serial number is set, refers to word as second;
The identical word statistical module 707 of part of speech, be configured as executing statistics described second with reference to part of speech in word with it is described
The quantity of the identical word of the part of speech of target word;
Second determines the accurate module 708 of part of speech, if being configured as executing the quantity described second with reference to word sum
Accounting in amount is greater than default accounting threshold value, it is determined that the part of speech of the target word is accurate.
Optionally, described device 700 further include:
First accurate set of words determining module 709 is configured as executing using the accurate target word of part of speech as first
Accurate set of words;
Third obtains module 710 with reference to word, is configured as executing third setting sequence in the acquisition similar sequence of terms
Number word, as third refer to word;
It is final to determine the accurate module 711 of part of speech, belong to first standard with reference to word if being configured as the execution third
True set of words, and the part of speech of the target word and the third are identical with reference to the part of speech of word, and the target word with
The third is greater than the second similarity threshold with reference to the similarity of word, then finally determines that the part of speech of the target word is accurate;
Wherein, second similarity threshold is greater than first similarity threshold.
Optionally, described device 700 further include:
Second accurate set of words determining module 712 is configured as executing the accurate mesh of part of speech finally determined
Word is marked as the second accurate set of words;
Doubtful mistake set of words obtains module 713, is configured as execution acquisition and is confirmed as the doubtful inaccuracy of part of speech
Target word obtains doubtful wrong set of words;
Doubtful mistake word similar word language set obtains module 714, is configured as executing the acquisition doubtful wrong word
The corresponding similar set of words of the doubtful mistake word of each of set obtains doubtful wrong word similar word language set;
Determine part of speech error module 715, if be configured as executing the doubtful wrong word similar word language set with it is described
Second accurate set of words includes identical word, and the part of speech of the identical word is different from the part of speech of the target word, then
Determine the part of speech mistake of the target word.
Optionally, described device 700 further include:
Error correction word collection obtains module 716, is configured as executing the corresponding similar word of target word for obtaining part of speech mistake
Language counts the frequency of occurrence that the error correction word concentrates various parts of speech as error correction word collection;
First replacement module 717, if being configured as the execution error correction word concentration, there are frequency of occurrence to be more than or equal to 2
Part of speech then replaces the part of speech of the target word of the part of speech mistake with the highest part of speech of frequency of occurrence;
Frequency of occurrence is not present more than or equal to 2 if being configured as the execution error correction word and concentrating in second replacement module 718
Part of speech, then concentrate corresponding with the highest word of target Words similarity of part of speech mistake word with the error correction word
Property, replace the part of speech of the target word of the part of speech mistake.
In conclusion determining word part of speech device 700 provided in an embodiment of the present invention, passes through the word to the target word
Property accuracy determined twice, finally determine the target word of the doubtful inaccuracy of part of speech in the target word, and when sentencing
It sets the goal after the doubtful inaccuracy of part of speech of word, the part of speech of the target word is corrected.The above method can accurately distinguish
Out part of speech be determined mistake target word, and can the target word words and phrases accurately to part of speech mistake correct, into
And the part of speech that the method that part of speech is determined in through the embodiment of the present invention is determined has obtained effective inspection, and accurately corrects
The part of speech of mistake.
About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method
Embodiment in be described in detail, no detailed explanation will be given here.
Fig. 8 is shown according to an exemplary embodiment a kind of for determining the block diagram of the electronic equipment 800 of word part of speech.
For example, electronic equipment 800 can be mobile phone, and computer, digital broadcasting terminal, messaging device, game console,
Tablet device, Medical Devices, body-building equipment, personal digital assistant etc..
Referring to Fig. 8, electronic equipment 800 may include following one or more components: processing component 802, memory 804,
Electric power assembly 806, multimedia component 808, audio component 810, the interface 812 of input/output (I/O), sensor module 814,
And communication component 816.
The integrated operation of the usual controlling electronic devices 800 of processing component 802, such as with display, call, data are logical
Letter, camera operation and record operate associated operation.Processing component 802 may include one or more processors 820 to hold
Row instruction, to perform all or part of the steps of the methods described above.In addition, processing component 802 may include one or more moulds
Block, convenient for the interaction between processing component 802 and other assemblies.For example, processing component 802 may include multi-media module, with
Facilitate the interaction between multimedia component 808 and processing component 802.
Memory 804 is configured as storing various types of data to support the operation in equipment 800.These data are shown
Example includes the instruction of any application or method for operating on electronic equipment 800, contact data, telephone directory number
According to, message, picture, video etc..Memory 704 can by any kind of volatibility or non-volatile memory device or it
Combination realize, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) can
Erasable programmable read-only memory (EPROM) (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory,
Flash memory, disk or CD.
Power supply module 806 provides electric power for the various assemblies of electronic equipment 800.Power supply module 806 may include power supply pipe
Reason system, one or more power supplys and other with for electronic equipment 800 generate, manage, and distribute the associated component of electric power.
Multimedia component 808 includes the screen of one output interface of offer between the electronic equipment 800 and user.
In some embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch surface
Plate, screen may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touchings
Sensor is touched to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or cunning
The boundary of movement, but also detect duration and pressure associated with the touch or slide operation.In some embodiments
In, multimedia component 808 includes a front camera and/or rear camera.When equipment 800 is in operation mode, as clapped
When taking the photograph mode or video mode, front camera and/or rear camera can receive external multi-medium data.Before each
Setting camera and rear camera can be a fixed optical lens system or has focusing and optical zoom capabilities.
Audio component 810 is configured as output and/or input audio signal.For example, audio component 810 includes a Mike
Wind (MIC), when electronic equipment 800 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone
It is configured as receiving external audio signal.The received audio signal can be further stored in memory 804 or via logical
Believe that component 816 is sent.In some embodiments, audio component 810 further includes a loudspeaker, is used for output audio signal.
I/O interface 812 provides interface between processing component 802 and peripheral interface module, and above-mentioned peripheral interface module can
To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and
Locking press button.
Sensor module 814 includes one or more sensors, for providing the state of various aspects for electronic equipment 800
Assessment.For example, sensor module 814 can detecte the state that opens/closes of equipment 800, the relative positioning of component, such as
The component is the display and keypad of electronic equipment 800, and sensor module 814 can also detect electronic equipment 800 or electricity
The position change of sub- 800 1 components of equipment, the existence or non-existence that user contacts with electronic equipment 800, electronic equipment 800
The temperature change of orientation or acceleration/deceleration and electronic equipment 800.Sensor module 814 may include proximity sensor, be matched
It sets for detecting the presence of nearby objects without any physical contact.Sensor module 814 can also include light sensing
Device, such as CMOS or ccd image sensor, for being used in imaging applications.In some embodiments, the sensor module 814
It can also include acceleration transducer, gyro sensor, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 816 is configured to facilitate the communication of wired or wireless way between electronic equipment 800 and other equipment.
Electronic equipment 800 can access the wireless network based on communication standard, such as WiFi, carrier network (such as 2G, 3G, 4G or
5G) or their combination.In one exemplary embodiment, communication component 816 is received via broadcast channel from external wide
The broadcast singal or broadcast related information of broadcast management system.In one exemplary embodiment, the communication component 816 also wraps
Near-field communication (NFC) module is included, to promote short range communication.For example, it can be based on radio frequency identification (RFID) technology in NFC module, it is red
Outer data association (IrDA) technology, ultra wide band (UWB) technology, bluetooth (BT) technology and other technologies are realized.
In the exemplary embodiment, electronic equipment 800 can by one or more application specific integrated circuit (ASIC),
Digital signal processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate
Array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.
In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided
It such as include the memory 804 of instruction, above-metioned instruction can be executed by the processor 820 of electronic equipment 800 to complete the above method.
For example, the non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, magnetic
Band, floppy disk and optical data storage devices etc..
Fig. 9 is shown according to an exemplary embodiment a kind of for determining the block diagram of the electronic equipment 900 of word part of speech.
For example, electronic equipment 900 may be provided as a server.Referring to Fig. 9, electronic equipment 900 includes processing component 922,
It further comprise one or more processors, and the memory resource as representated by memory 932, it can be by for storing
Manage the instruction of the execution of component 922, such as application program.The application program stored in memory 932 may include one or one
Each more than a corresponds to the module of one group of instruction.In addition, processing component 922 is configured as executing instruction, on executing
State recommended method.
Electronic equipment 900 can also include that a power supply module 928 is configured as executing the power supply pipe of electronic equipment 900
Reason, a wired or wireless network interface 950 are configured as electronic equipment 900 being connected to network and an input and output
(I/O) interface 959.Electronic equipment 900 can be operated based on the operating system for being stored in memory 932, such as Windows
ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.
Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to of the invention its
Its embodiment.This application is intended to cover any variations, uses, or adaptations of the invention, these modifications, purposes or
Person's adaptive change follows general principle of the invention and including the undocumented common knowledge in the art of the disclosure
Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following
Claim is pointed out.
It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and
And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims.
Claims (10)
1. a kind of method of determining word part of speech characterized by comprising
The corresponding set of words of target text is obtained, and obtains the target word it needs to be determined that part of speech from the set of words;
According to the corresponding term vector of word each in the set of words, its in the target word and the set of words is determined
Similarity between his word, wherein include the word of known part of speech in other described words;
Based on the similarity of the word of known part of speech in the target word and other described words, the target word is determined
Part of speech.
2. the method according to claim 1, wherein described based in the target word and other described words
The similarity of the word of known part of speech determines the part of speech of the target word, comprising:
The word for being greater than the first similarity threshold with the similarity of the target word is chosen from other described words, as institute
State the similar word of target word;
Whether the part of speech of in the similar word language each word is recorded in detection part of speech table;
If record has the part of speech of partial words in the similar word language, the institute recorded from the part of speech table in the part of speech table
It states in similar word, obtains and the similar word of the highest known part of speech of the similarity of the target word;
It will be determined as the target word to the part of speech of the similar word of the highest known part of speech of the similarity of the target word
Part of speech.
3. according to the method described in claim 2, it is characterized in that, whether having recorded the similar word language in detection part of speech table
In each word part of speech after, the method also includes:
If record has the part of speech of whole words in the similar word language in the part of speech table, the part of speech of the similar word is counted
The middle highest part of speech of frequency of occurrence;
The highest part of speech of the frequency of occurrence is determined as to the part of speech of the target word.
4. according to the method described in claim 2, it is characterized in that, the similar word recorded from the part of speech table
In, it obtains and the similar word of the highest known part of speech of the similarity of the target word, comprising:
According to the sequence of the similarity with the target word from high to low, whether the part of speech for successively detecting the similar word is remembered
Record in the part of speech table, until the similar word for detecting to be recorded in the part of speech table, then using the similar word as
To the similar word of the highest known part of speech of the similarity of the target word.
5. according to the method in claim 2 or 3, which is characterized in that based on the target word and other described words
In known part of speech word similarity, after the part of speech for determining the target word, the method also includes:
The part of speech of the target word and the target word is added in the part of speech table, updated part of speech table, institute are obtained
Updated part of speech table is stated for determining the part of speech of next target word.
6. the method according to claim 1, wherein described corresponding according to word each in the set of words
Term vector determines the similarity in the target word and the set of words between other words, comprising:
The corresponding term vector of each word in the set of words is obtained from default term vector tables of data;
Calculate the inner product in the term vector and the set of words of the target word between the term vector of other each words;
Calculate the product of 2 norms of 2 norms of the term vector of the target word and the term vector of other each words;
The similarity between the target word and other described words is determined according to the ratio of the inner product and the product.
7. according to the method described in claim 6, it is characterized in that, the default term vector tables of data obtains as follows
It arrives:
Obtain the corresponding trained set of words of each training text;
The corresponding object vector of each word is established according to the frequency of occurrence of each word in the trained set of words;
In the trained set of words, the corresponding window word of each word is determined according to the window parameter chosen in advance;
Each word in the trained set of words and the corresponding window word of each described word are combined;
Respectively using the object vector of the word in the combination as the input of object module, window word in the combination
Desired output information of the object vector as the object module, is trained the object module, and by the target mould
The vector of the hidden layer output of type is as term vector;
The obtained word and corresponding term vector is trained to be added in default term vector tables of data each training text.
8. a kind of determining word part of speech device characterized by comprising
Target word obtains module, is configured as executing and obtains the corresponding set of words of target text, and from the set of words
It is middle to obtain the target word it needs to be determined that part of speech;
Similarity determining module is configured as executing determining institute according to the corresponding term vector of word each in the set of words
State the similarity in target word and the set of words between other words, wherein include known words in other described words
The word of property;
Part of speech determining module is configured as executing based on the word of known part of speech in the target word and other described words
Similarity determines the part of speech of the target word.
9. a kind of electronic equipment, comprising: processor;Memory for storage processor executable instruction;Wherein, the processing
Device is configured as executing the method such as determining word part of speech of any of claims 1-7.
10. a kind of application program/computer program product, when the instruction in the storage medium is held by the processor of mobile terminal
When row, so that mobile terminal is able to carry out the method such as determining word part of speech of any of claims 1-7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910464521.5A CN110377899A (en) | 2019-05-30 | 2019-05-30 | A kind of method, apparatus and electronic equipment of determining word part of speech |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910464521.5A CN110377899A (en) | 2019-05-30 | 2019-05-30 | A kind of method, apparatus and electronic equipment of determining word part of speech |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110377899A true CN110377899A (en) | 2019-10-25 |
Family
ID=68248850
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910464521.5A Pending CN110377899A (en) | 2019-05-30 | 2019-05-30 | A kind of method, apparatus and electronic equipment of determining word part of speech |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110377899A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110852077A (en) * | 2019-11-13 | 2020-02-28 | 泰康保险集团股份有限公司 | Method, device, medium and electronic equipment for dynamically adjusting Word2Vec model dictionary |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101295295A (en) * | 2008-06-13 | 2008-10-29 | 中国科学院计算技术研究所 | Chinese language lexical analysis method based on linear model |
JP2010250814A (en) * | 2009-04-14 | 2010-11-04 | Nec (China) Co Ltd | Part-of-speech tagging system, training device and method of part-of-speech tagging model |
CN106095754A (en) * | 2016-06-08 | 2016-11-09 | 广州同构医疗科技有限公司 | A kind of medical terminology dictionary part-of-speech tagging method |
CN107291693A (en) * | 2017-06-15 | 2017-10-24 | 广州赫炎大数据科技有限公司 | A kind of semantic computation method for improving term vector model |
CN108170674A (en) * | 2017-12-27 | 2018-06-15 | 东软集团股份有限公司 | Part-of-speech tagging method and apparatus, program product and storage medium |
CN109344406A (en) * | 2018-09-30 | 2019-02-15 | 阿里巴巴集团控股有限公司 | Part-of-speech tagging method, apparatus and electronic equipment |
CN109388801A (en) * | 2018-09-30 | 2019-02-26 | 阿里巴巴集团控股有限公司 | The determination method, apparatus and electronic equipment of similar set of words |
CN109472008A (en) * | 2018-11-20 | 2019-03-15 | 武汉斗鱼网络科技有限公司 | A kind of Text similarity computing method, apparatus and electronic equipment |
CN109710921A (en) * | 2018-12-06 | 2019-05-03 | 深圳市中农易讯信息技术有限公司 | Calculation method, device, computer equipment and the storage medium of Words similarity |
-
2019
- 2019-05-30 CN CN201910464521.5A patent/CN110377899A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101295295A (en) * | 2008-06-13 | 2008-10-29 | 中国科学院计算技术研究所 | Chinese language lexical analysis method based on linear model |
JP2010250814A (en) * | 2009-04-14 | 2010-11-04 | Nec (China) Co Ltd | Part-of-speech tagging system, training device and method of part-of-speech tagging model |
CN106095754A (en) * | 2016-06-08 | 2016-11-09 | 广州同构医疗科技有限公司 | A kind of medical terminology dictionary part-of-speech tagging method |
CN107291693A (en) * | 2017-06-15 | 2017-10-24 | 广州赫炎大数据科技有限公司 | A kind of semantic computation method for improving term vector model |
CN108170674A (en) * | 2017-12-27 | 2018-06-15 | 东软集团股份有限公司 | Part-of-speech tagging method and apparatus, program product and storage medium |
CN109344406A (en) * | 2018-09-30 | 2019-02-15 | 阿里巴巴集团控股有限公司 | Part-of-speech tagging method, apparatus and electronic equipment |
CN109388801A (en) * | 2018-09-30 | 2019-02-26 | 阿里巴巴集团控股有限公司 | The determination method, apparatus and electronic equipment of similar set of words |
CN109472008A (en) * | 2018-11-20 | 2019-03-15 | 武汉斗鱼网络科技有限公司 | A kind of Text similarity computing method, apparatus and electronic equipment |
CN109710921A (en) * | 2018-12-06 | 2019-05-03 | 深圳市中农易讯信息技术有限公司 | Calculation method, device, computer equipment and the storage medium of Words similarity |
Non-Patent Citations (1)
Title |
---|
孟禹光等: "引入词性标记的基于语境相似度的词义消歧", 《中文信息学报》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110852077A (en) * | 2019-11-13 | 2020-02-28 | 泰康保险集团股份有限公司 | Method, device, medium and electronic equipment for dynamically adjusting Word2Vec model dictionary |
CN110852077B (en) * | 2019-11-13 | 2023-03-31 | 泰康保险集团股份有限公司 | Method, device, medium and electronic equipment for dynamically adjusting Word2Vec model dictionary |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107766426B (en) | Text classification method and device and electronic equipment | |
CN110209844B (en) | Multimedia data matching method, device and storage medium | |
US20170052947A1 (en) | Methods and devices for training a classifier and recognizing a type of information | |
CN108121736A (en) | A kind of descriptor determines the method for building up, device and electronic equipment of model | |
CN110852100A (en) | Keyword extraction method, keyword extraction device, electronic equipment and medium | |
CN104243814B (en) | Analysis method, image taking reminding method and the device of objects in images layout | |
CN110008401A (en) | Keyword extracting method, keyword extracting device and computer readable storage medium | |
EP3767488A1 (en) | Method and device for processing untagged data, and storage medium | |
CN108008832A (en) | A kind of input method and device, a kind of device for being used to input | |
CN111984749B (en) | Interest point ordering method and device | |
CN109543066A (en) | Video recommendation method, device and computer readable storage medium | |
CN109040605A (en) | Shoot bootstrap technique, device and mobile terminal and storage medium | |
CN108399914A (en) | A kind of method and apparatus of speech recognition | |
CN108345581A (en) | A kind of information identifying method, device and terminal device | |
CN110389667A (en) | A kind of input method and device | |
CN108345612A (en) | A kind of question processing method and device, a kind of device for issue handling | |
CN108509406B (en) | Corpus extraction method and device and electronic equipment | |
CN107564526A (en) | Processing method, device and machine readable media | |
CN109615006A (en) | Character recognition method and device, electronic equipment and storage medium | |
CN109360197A (en) | Processing method, device, electronic equipment and the storage medium of image | |
CN107133354A (en) | The acquisition methods and device of description information of image | |
CN109783656A (en) | Recommended method, system and the server and storage medium of audio, video data | |
CN107301862A (en) | A kind of audio recognition method, identification model method for building up, device and electronic equipment | |
CN109815396A (en) | Search term Weight Determination and device | |
CN112884040B (en) | Training sample data optimization method, system, storage medium and electronic equipment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20191025 |
|
RJ01 | Rejection of invention patent application after publication |