CN110377899A

CN110377899A - A kind of method, apparatus and electronic equipment of determining word part of speech

Info

Publication number: CN110377899A
Application number: CN201910464521.5A
Authority: CN
Inventors: 刘春�
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2019-05-30
Filing date: 2019-05-30
Publication date: 2019-10-25

Abstract

The disclosure is directed to method, apparatus, electronic equipment and the computer readable storage mediums of a kind of determining word part of speech, the described method includes: obtaining the corresponding set of words of target text, and the target word it needs to be determined that part of speech is obtained from the set of words, according to the corresponding term vector of word each in the set of words, determine the similarity in the target word and the set of words between other words, based on the similarity of the word of known part of speech in the target word and other described words, the part of speech of the target word is finally determined.In the above-mentioned methods, it is higher according to the Words similarity with same or similar context, and the higher word of similarity, the biggish principle of a possibility that part of speech is identical, the part of speech of target word is calculated automatically based on target word and the similarity of the word of known part of speech, compared to the method that artificial mark obtains part of speech, efficiency and accuracy rate are higher.

Description

A kind of method, apparatus and electronic equipment of determining word part of speech

Technical field

This disclosure relates to network communication field more particularly to a kind of method, apparatus of determining word part of speech, electronic equipment and Computer readable storage medium.

Background technique

The fast development of novel social media platform further reduced the threshold of people's exchange, following a large amount of nets Network neologisms continue to bring out among the comment of user, correctly identify that network neologisms and corresponding part of speech comment processing user By being of great significance.

In the related technology, for the determination of neologisms part of speech, the stage being labeled by artificial screening is remained at present, There is not the automatic obtaining method for neologisms part of speech.Network neologisms enormous amount and complexity height, are marked by manual method Efficiency is lower and result is not accurate enough.

Summary of the invention

To overcome the problems in correlation technique, the disclosure provides a kind of method, apparatus of determining word part of speech, electronics Equipment and computer readable storage medium.

According to the first aspect of the embodiments of the present disclosure, a kind of method of determining word part of speech is provided, comprising:

The corresponding set of words of target text is obtained, and obtains the target word it needs to be determined that part of speech from the set of words Language；

According to the corresponding term vector of word each in the set of words, the target word and the set of words are determined In similarity between other words, wherein include the word of known part of speech in other described words；

Based on the similarity of the word of known part of speech in the target word and other described words, the target word is determined The part of speech of language.

Optionally, the similarity based on the word of known part of speech in the target word and other described words, really The part of speech of the fixed target word, comprising:

The word for being greater than the first similarity threshold with the similarity of the target word is chosen from other described words, is made For the similar word of the target word；

Whether the part of speech of in the similar word language each word is recorded in detection part of speech table；

If record has the part of speech of partial words in the similar word language in the part of speech table, recorded from the part of speech table The similar word in, obtain with the similarity of the target word it is highest known to part of speech similar word；

It will be determined as the target to the part of speech of the similar word of the highest known part of speech of the similarity of the target word The part of speech of word.

Optionally, whether recorded in the similar word language after the part of speech of each word in detection part of speech table, it is described Method further include:

If record has the part of speech of whole words in the similar word language in the part of speech table, the similar word is counted The highest part of speech of frequency of occurrence in part of speech；

The highest part of speech of the frequency of occurrence is determined as to the part of speech of the target word.

Optionally, described from the similar word recorded in the part of speech table, obtain the phase with the target word The similar word of part of speech like known to degree is highest, comprising:

According to the sequence of the similarity with the target word from high to low, the part of speech for successively detecting the similar word is It is no to be recorded in the part of speech table, until the similar word for detecting to be recorded in the part of speech table, then by the similar word As to the similar word of the highest known part of speech of the similarity of the target word.

Optionally, it in the similarity based on the target word with the word of known part of speech in other described words, determines After the part of speech of the target word, the method also includes:

The part of speech of the target word and the target word is added in the part of speech table, updated part of speech is obtained Table, the updated part of speech table are used to determine the part of speech of next target word.

Optionally, described according to the corresponding term vector of word each in the set of words, determine the target word with Similarity in the set of words between other words, comprising:

The corresponding term vector of each word in the set of words is obtained from default term vector tables of data；

It calculates interior between the term vector of other each words in the term vector and the set of words of the target word Product；

Calculate 2 norms of the term vector of the target word and multiplying for 2 norms of the term vector of other each words Product；

It is determined according to the inner product to the ratio of the product similar between the target word and other described words Degree.

Optionally, the default term vector tables of data obtains as follows:

Obtain the corresponding trained set of words of each training text；

According to the frequency of occurrence of each word in the trained set of words establish the corresponding target of each word to Amount；

In the trained set of words, the corresponding window word of each word is determined according to the window parameter chosen in advance Language；

Each word in the trained set of words and the corresponding window word of each described word are combined；

Window word respectively using the object vector of the word in the combination as the input of object module, in the combination Desired output information of the object vector of language as the object module, is trained the object module, and by the mesh The vector of hidden layer output of model is marked as term vector；

The obtained word and corresponding term vector is trained to be added to default term vector tables of data each training text In.

Optionally, the frequency of occurrence according to each word in the trained set of words establishes each word pair The object vector answered, comprising:

The frequency of occurrence of each word in the trained set of words is counted, and from high to low according to the frequency of occurrence Sequence the call number of each word in the trained set of words is set；

The corresponding object vector of each word is established according to the call number.

Optionally, it in the similarity based on the target word with the word of known part of speech in other described words, determines After the part of speech of the target word, further includes:

The similar word of the target word is arranged according to the similarity descending of the target word, obtains the mesh Mark the similar sequence of terms of word；

The word for obtaining the first setting serial number in the similar sequence of terms, refers to word as first；

The part of speech that described first whether is recorded in the part of speech table with reference to word detected；

If so, judging whether the part of speech of the target word is consistent with the part of speech of the first reference word；

If so, determining that the part of speech of the target word is accurate.

Optionally, the similar word of the target word is being arranged according to the similarity descending of the target word, After obtaining the similar sequence of terms of the target word, further includes:

The word to be sorted before the second setting serial number in the similar sequence of terms is obtained, refers to word as second；

Count the described second quantity with reference to part of speech word identical with the part of speech of the target word in word；

If the quantity is greater than default accounting threshold value with reference to the accounting in word total quantity described second, it is determined that described The part of speech of target word is accurate.

Optionally, after the part of speech for determining the target word is accurate, the method also includes:

Using the accurate target word of part of speech as the first accurate set of words；

The word for obtaining third setting serial number in the similar sequence of terms, refers to word as third；

If the third belongs to the described first accurate set of words with reference to word, and the part of speech of the target word with it is described Third is identical with reference to the part of speech of word, and the target word is similar greater than second with reference to the similarity of word to the third Threshold value is spent, then finally determines that the part of speech of the target word is accurate；Wherein, second similarity threshold is greater than described first Similarity threshold.

Optionally, after the part of speech for finally determining the target word is accurate, further includes:

Using the accurate target word of part of speech finally determined as the second accurate set of words；

The target word for being confirmed as the doubtful inaccuracy of part of speech is obtained, doubtful wrong set of words is obtained；

The corresponding similar set of words of the doubtful mistake word of each of described doubtful wrong set of words is obtained, is obtained Doubtful mistake word similar word language set；

If the doubtful wrong word similar word language set includes identical word, and institute with the described second accurate set of words The part of speech for stating identical word is different from the part of speech of the target word, it is determined that the part of speech mistake of the target word.

Optionally, after the part of speech mistake for determining the target word, the method also includes:

The corresponding similar word of target word of part of speech mistake is obtained as error correction word collection, and counts the error correction word Concentrate the frequency of occurrence of various parts of speech；

If the error correction word is concentrated there are the part of speech that frequency of occurrence is more than or equal to 2, the highest part of speech of frequency of occurrence is used Replace the part of speech of the target word of the part of speech mistake；

If the error correction word is concentrated there is no the part of speech that frequency of occurrence is more than or equal to 2, concentrated with the error correction word Part of speech corresponding with the highest word of target Words similarity of the part of speech mistake, replaces the target word of the part of speech mistake The part of speech of language.

According to the second aspect of an embodiment of the present disclosure, a kind of device of determining word part of speech is provided, comprising:

Target word obtains module, is configured as executing and obtains the corresponding set of words of target text, and from the word The target word it needs to be determined that part of speech is obtained in set；

Similarity determining module is configured as executing according to the corresponding term vector of word each in the set of words, really Similarity in the fixed target word and the set of words between other words, wherein include in other described words Know the word of part of speech；

Part of speech determining module is configured as executing the word based on known part of speech in the target word and other described words The similarity of language determines the part of speech of the target word.

Optionally, the part of speech determining module, comprising:

Similar word chooses module, and it is similar to the target word to be configured as executing the selection from other described words Degree is greater than the word of the first similarity threshold, the similar word as the target word；

Part of speech detection module is configured as executing in detection part of speech table whether recorded each word in the similar word language Part of speech；

First similar word obtains module, if being configured as executing record in the part of speech table has in the middle part of the similar word language The part of speech for segmenting language obtains the similarity with the target word then from the similar word recorded in the part of speech table The similar word of highest known part of speech；

First part of speech determining module, be configured as executing by with the highest known part of speech of the similarity of the target word The part of speech of similar word is determined as the part of speech of the target word.

Optionally, the part of speech determining module further include:

Part of speech statistical module, if being configured as executing record in the part of speech table has whole words in the similar word language Part of speech then counts the highest part of speech of frequency of occurrence in the part of speech of the similar word；

Second part of speech determining module is configured as execution for the highest part of speech of the frequency of occurrence and is determined as the target word The part of speech of language.

Optionally, the described first similar word obtains module, comprising:

First similar word acquisition submodule is configured as executing according to the similarity with the target word from high to low Sequence, whether the part of speech for successively detecting the similar word be recorded in the part of speech table, described until detecting to be recorded in Similar word in part of speech table, then using the similar word as with the highest known part of speech of the similarity of the target word Similar word.

Optionally, described device further include:

Part of speech table update module, being configured as executing will be described in the part of speech of the target word and the target word be added In part of speech table, updated part of speech table is obtained, the updated part of speech table is used to determine the part of speech of next target word.

Optionally, the similarity determining module, comprising:

Term vector acquisition submodule is configured as execution and is obtained in the set of words from default term vector tables of data often The corresponding term vector of a word；

Inner product computational submodule is configured as executing every in the term vector for calculating the target word and the set of words Inner product between the term vector of other a words；

Product computational submodule is configured as executing 2 norms of the term vector for calculating the target word and described each The product of 2 norms of the term vector of other words；

Similarity determines submodule, is configured as executing and determines the target according to the ratio of the inner product and the product Similarity between word and other described words.

Optionally, the term vector acquisition submodule, comprising:

Training word acquiring unit is configured as executing the corresponding trained set of words of each training text of acquisition；

Object vector establishes unit, is configured as executing the occurrence out according to each word in the trained set of words Number establishes the corresponding object vector of each word；

Window word determination unit is configured as executing in the trained set of words, according to the window chosen in advance Parameter determines the corresponding window word of each word；

Assembled unit is configured as executing each word in the trained set of words and each described word pair The window word answered is combined；

Training unit is configured to using the object vector of the word in the combination as the input of object module, Desired output information of the object vector of window word in the combination as the object module, to the object module It is trained, and using the vector of the hidden layer of object module output as term vector；

Adding unit is configured as training the obtained word and corresponding term vector to be added to each training text In default term vector tables of data.

Optionally, the object vector establishes unit, comprising:

Subelement is arranged in call number, is configured as executing the occurrence out for counting each word in the trained set of words It counts, and the call number of each word in the trained set of words is set according to the sequence of the frequency of occurrence from high to low；

Object vector establishes subelement, is configured as executing and establishes the corresponding target of each word according to the call number Vector.

Optionally, described device further include:

Similar sequence of terms obtains module, be configured as executing by the similar word of the target word according to the mesh The similarity descending arrangement for marking word, obtains the similar sequence of terms of the target word；

First obtains module with reference to word, is configured as executing the first setting serial number in the acquisition similar sequence of terms Word refers to word as first；

Detection module is configured as executing the word for detecting and whether recording described first in the part of speech table with reference to word Property；

Judgment module is configured as executing if so, judging that the part of speech of the target word and described first refers to word Part of speech it is whether consistent；

First determines the accurate module of part of speech, is configured as executing if so, determining that the part of speech of the target word is accurate.

Optionally, described device further include:

Second obtains module with reference to word, is configured as executing obtaining and be sorted in the similar sequence of terms in the second setting Word before serial number refers to word as second；

The identical word statistical module of part of speech is configured as executing statistics described second with reference to part of speech in word and the target The quantity of the identical word of the part of speech of word；

Second determines the accurate module of part of speech, if being configured as executing the quantity described second with reference in word total quantity Accounting be greater than default accounting threshold value, it is determined that the part of speech of the target word is accurate.

Optionally, described device further include:

First accurate set of words determining module, it is accurate using the accurate target word of part of speech as first to be configured as executing Set of words；

Third obtains module with reference to word, is configured as executing third setting serial number in the acquisition similar sequence of terms Word refers to word as third；

It is final to determine the accurate module of part of speech, belong to first accurate word with reference to word if being configured as the execution third Language set, and the part of speech of the target word and the third are identical with reference to the part of speech of word, and the target word with it is described Third is greater than the second similarity threshold with reference to the similarity of word, then finally determines that the part of speech of the target word is accurate；Its In, second similarity threshold is greater than first similarity threshold.

Optionally, described device further include:

Second accurate set of words determining module is configured as executing the accurate target word of part of speech finally determined Language is as the second accurate set of words；

Doubtful mistake set of words obtains module, is configured as executing the target for obtaining and being confirmed as the doubtful inaccuracy of part of speech Word obtains doubtful wrong set of words；

Doubtful mistake word similar word language set obtains module, is configured as executing the acquisition doubtful wrong set of words Each of the corresponding similar set of words of doubtful wrong word, obtain doubtful wrong word similar word language set；

Part of speech error module is determined, if being configured as executing the doubtful wrong word similar word language set and described second Accurate set of words includes identical word, and the part of speech of the identical word is different from the part of speech of the target word, it is determined that The part of speech mistake of the target word.

Optionally, described device further include:

Error correction word collection obtains module, is configured as executing the corresponding similar word work of target word for obtaining part of speech mistake For error correction word collection, and count the frequency of occurrence that the error correction word concentrates various parts of speech；

First replacement module, if being configured as executing word of the error correction word concentration there are frequency of occurrence more than or equal to 2 Property, then the part of speech of the target word of the part of speech mistake is replaced with the highest part of speech of frequency of occurrence；

Second replacement module, if being configured as the execution error correction word concentration, there is no frequency of occurrence to be more than or equal to 2 Part of speech then concentrates part of speech corresponding with the highest word of target Words similarity of the part of speech mistake with the error correction word, Replace the part of speech of the target word of the part of speech mistake.

According to the third aspect of an embodiment of the present disclosure, a kind of electronic equipment is provided, comprising: processor；It is handled for storage The memory of device executable instruction；Wherein, the processor is configured to executing determination word part of speech as described in relation to the first aspect Method.

According to a fourth aspect of embodiments of the present disclosure, a kind of application program/computer program product is provided, when the storage When instruction in medium is executed by the processor of mobile terminal, so that mobile terminal is able to carry out as described in relation to the first aspect really Determine the method for word part of speech.

The technical scheme provided by this disclosed embodiment can include the following benefits:

In embodiments of the present invention, it by obtaining the corresponding set of words of target text, and is obtained from the set of words The target word it needs to be determined that part of speech is taken, according to the corresponding term vector of word each in the set of words, determines the target Similarity in word and the set of words between other words is based in the target word and other described words Know the similarity of the word of part of speech, finally determines the part of speech of the target word.In the above-mentioned methods, according to identical or The biggish principle of a possibility that Words similarity of similar context is higher, and the higher word of similarity, and part of speech is identical, is based on Target word and the similarity of the word of known part of speech calculate the part of speech of target word automatically, compared to artificial mark The method for obtaining part of speech, efficiency and accuracy rate are higher.

It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not The disclosure can be limited.

Detailed description of the invention

The drawings herein are incorporated into the specification and forms part of this specification, and shows and meets implementation of the invention Example, and be used to explain the principle of the present invention together with specification.

Fig. 1 is the flow chart of the method for the first determination word part of speech shown according to an exemplary embodiment；

Fig. 2 is the flow chart of the method for second shown according to an exemplary embodiment determining word part of speech；

Fig. 3 is shown according to an exemplary embodiment a kind of using skip-gram term vector model progress term vector The schematic diagram of habit；

Fig. 4 is that shown according to an exemplary embodiment the third determines the flow chart of the method for word part of speech；

Fig. 5 is the block diagram of the first determination word part of speech device shown according to an exemplary embodiment；

Fig. 6 is the block diagram of second shown according to an exemplary embodiment determining word part of speech device；

Fig. 7 is that shown according to an exemplary embodiment the third determines the block diagram of word part of speech device；

Fig. 8 is a kind of block diagram of optional electronic equipment shown according to an exemplary embodiment；

Fig. 9 is the block diagram of another optional electronic equipment shown according to an exemplary embodiment.

Specific embodiment

Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary is implemented Embodiment described in example does not represent all embodiments consistented with the present invention.On the contrary, they are only and such as institute The example of device and method be described in detail in attached claims, that some aspects of the invention are consistent.

Fig. 1 is the flow chart of the method for the first determination word part of speech shown according to an exemplary embodiment, such as Fig. 1 institute Show, this method is for including the following steps in terminal.

In step s101, the corresponding set of words of target text is obtained, and obtains and is needed really from the set of words Determine the target word of part of speech.

In embodiments of the present invention, target text can be the text information that user in the network platform delivers, text letter Breath, which can be user comment, user delivers has a talk about, circle of friends text, article etc..In these text informations, it may wrap Containing some network neologisms, proprietary vocabulary etc., network neologisms be the network created by modes such as partials, wrong words it is popular it is non-just Formula language, such as " refreshing horse ", " beating call " etc.；Proprietary vocabulary is the neologism generated with the new thought of new things, such as " is clapped fast Hand ", " people sets " etc..Due to being emerging word, part of speech is unknown, the present invention for these network neologisms and proprietary vocabulary Embodiment is to carry out the judgement of part of speech using the network neologisms of these parts of speech to be determined and proprietary vocabulary as target word.Word Property the characteristics of referring to using word as the basis for drawing word classification, the word of Modern Chinese can be divided into 14 kinds of parts of speech, such as: noun, Verb, adjective, distinction word, pronoun, number, quantifier, adverbial word, preposition, conjunction, auxiliary word, modal particle, onomatopoeia, interjection.

Word segmentation processing is carried out to target text first, obtains all words for including in target text and target text In include part of speech to be determined target word.Because Chinese word is made of individual Chinese character, many individual Chinese characters itself can not be only Vertical that a word is used as to apply or play grammatical function, by continuous Chinese character segmentation at meaningful word, this process is just named It segments.In order to obtain in target text part of speech to be determined target word, a new word dictionary can be pre-established, this is new Word dictionary includes the neologisms as much as possible currently having already appeared.Participle software is updated based on the new word dictionary, uses update Participle software afterwards carries out word segmentation processing to target text, obtains a set of words after word segmentation processing, wraps in the set of words Include the target word it needs to be determined that part of speech.

It should be noted that target word can also include other kinds of word in addition to network neologisms and proprietary vocabulary, For example, changing semantic and part of speech known word, the embodiment of the present invention, which does not do the type of word that target word includes, to be had Body limits.

In step s 102, according to the corresponding term vector of word each in the set of words, the target word is determined With the similarity in the set of words between other words, wherein include the word of known part of speech in other described words.

In embodiments of the present invention, term vector is one group of Language Modeling in embedded natural language processing (NLP) and spy The general designation of learning art is levied, indicates that the word or expression from vocabulary is mapped to the vector of real number.It obtains in set of words The term vector of each word can be realized the sparse expression of the insertion expression and vocabulary high dimension vector of vocabulary.

The method for obtaining the term vector of word, which may is that, is trained the word input object module of training text, obtains To the corresponding term vector of each word.Default term vector tables of data is added in the corresponding term vector of each word, word can be preset at this The corresponding term vector of each word in the set of words is inquired in vector data table.

Generally, the lexical word vector distance with same or similar context is smaller, and the vocabulary of same or similar context Similarity in semantic and part of speech is higher, is based on the principle, target word and the word collection can be determined according to term vector Similarity in conjunction between other words.Specifically, it can be based on the distance between target word and other words, such as remaining Chordal distance, Euclidean distance, editing distance etc. calculate the similarity between word.

In addition, the part of speech in order to determine target word, other words in the set of words other than target word will be wrapped Include the word of known part of speech.

In step s 103, the similarity based on the word of known part of speech in the target word and other described words, Determine the part of speech of the target word.

In embodiments of the present invention, due to higher two words of similarity, a possibility that part of speech is identical, is larger, so It can be analyzed, be handled by the part of speech to other words similar with target word in set of words, and then determine mesh Mark the part of speech of word.

In conclusion in embodiments of the present invention, by obtaining the corresponding set of words of target text, and from the word The target word it needs to be determined that part of speech is obtained in set, according to the corresponding term vector of word each in the set of words, is determined Similarity in the target word and the set of words between other words, based on the target word and it is described other The similarity of the word of known part of speech in word finally determines the part of speech of the target word.In the above-mentioned methods, according to tool Have the Words similarity of same or similar context higher, and the higher word of similarity, a possibility that part of speech is identical biggish original Reason, calculates the part of speech of target word based on target word and the similarity of the word of known part of speech, automatically compared to people The method that work mark obtains part of speech, efficiency and accuracy rate are higher.

Fig. 2 is the flow chart of the method for second shown according to an exemplary embodiment determining word part of speech, the determination Method is the alternative embodiment of the determination method in Fig. 1, as shown in Fig. 2, the method for the determination word part of speech is used in terminal, Include the following steps.

In step s 201, the corresponding trained sequence of terms set of training text is obtained.

In embodiments of the present invention, training text can be a plurality of text information obtained from network, be based on neologisms word Allusion quotation carries out word segmentation processing to every text message, obtains corresponding trained sequence of terms, a plurality of text information is corresponding multiple Training sequence of terms integrates, and available one trained sequence of terms set, set can indicate are as follows: { training word sequence Column 1, training sequence of terms 2 ... }.Obviously, also comprising the neologisms in new word dictionary in the training sequence of terms set.

In step S202, each word is established according to the frequency of occurrence of each word in the trained set of words Corresponding object vector.

In embodiments of the present invention, the occurrence out of each word in the corresponding trained sequence of terms set of target text is counted It counts, and determines the object vector of each word according to the frequency of occurrence of each word.The vector of word is used to indicate that mankind's language A kind of mode of mathematicization of speech, simplest vector mode is one-hot vector form, so object vector can be One-hot vector.

Optionally, step S202 includes the following steps S2021- step S2022:

In step S2021, count the frequency of occurrence of each word in the trained set of words, and according to it is described go out The call number of each word in the trained set of words is arranged in the sequence of occurrence number from high to low.

In embodiments of the present invention, the occurrence out of each word in the corresponding trained sequence of terms set of target text is counted Number, each word is ranked up from high to low according to frequency of occurrence, and the call number of each word is arranged according to sequence.Example Such as, in training sequence of terms set, the frequency of occurrence sequence first of " people sets " this word, then by " people sets " this word Call number be set to 1, the frequency of occurrence of " plot " this word sequence second then sets the call number of " plot " this word It is 2.

In step S2022, the corresponding object vector of each word is established according to the call number.

In embodiments of the present invention, in the above-mentioned training set of words for having sequenced sequence, each root is according to its rope Quotation marks can indicate with the sparse vector that a M is tieed up, i.e. one-hot vector, only in the position that the word occurs in vector Element is just 1, and other elements are all 0.For example, set the dimension M of one-hot vector as 4, then word " the people that call number is 1 If " one-hot vector be expressed as: [1,0,0,0], the one-hot vector of word " plot " that call number is 2 are expressed as: [0,1,0,0]。

In step S203, in the trained set of words, each word is determined according to the window parameter chosen in advance The corresponding window word of language.

In embodiments of the present invention, the core concept of the term vector based on canonical correlation analysis, it considers that in passage Phrase in the window of the right and left designated length of some word should have association, that is to say, that the left side of some word is several Above, several words in the right form hereafter, so that the relationship of context should be allowed as close as possible word composition.Herein, that is, draw The concept of the window word of word is chosen out.Assuming that window parameter is 2, then the window word that two window of left and right of word includes Quantity is 2.For example, some set in the trained sequence of terms set trains sequence of terms as { W₁,W₂,W₃,W₄,W₅, W₆,……W_N, then W₄Window word be respectively { W₂,W₃, and { W₅,W₆}。

In step S204, by each word in the trained set of words and the corresponding window of each described word Word is combined.

It in embodiments of the present invention, is each of training sequence of terms word selected window word according to window parameter Language, and the corresponding window word of each word is combined.For example, setting the word C of selection as W₄, window word Language W is { W₂,W₃, and { W₅,W₆, then the combination { C, W } obtained is { W₄,{W₂,W₃, { W₅,W₆}}。

It is described respectively using the object vector of the word in the combination as the input of object module in step S205 Desired output information of the object vector of window word in combination as the object module carries out the object module Training, and using the vector of the hidden layer of object module output as term vector.

In embodiments of the present invention, the one-hot vector of word C in { C, W } will be combined respectively as the defeated of object module Enter, desired output information of the one-hot vector of window word W as the object module, the output of object module is word The probability value of the window word of C, is trained object module.Wherein, which is the computation model of term vector, example Such as using word2vec term vector learning method training skip-gram model.Skip-gram is basic three-layer neural network Model includes input layer, one layer of hidden layer and output layer, and hidden layer does not have activation primitive, and output layer activation primitive is Softmax is returned.The one-hot that it is current term C that it, which is inputted, is indicated, is exported as the probability value of window word, and target output is The one-hot vector of window word W.If training set of words is { W₁,W₂,……W_N, it is embedded in the one-hot vector of expression Dimension is M, then the input layer of network and output layer interstitial content are N, and hidden layer interstitial content is M, input layer-hidden layer Network weight matrix is W_N*_M, W_N*_MThe i-th row V_iAs vocabulary W_iTerm vector.

Fig. 3 is shown according to an exemplary embodiment a kind of using skip-gram term vector model progress term vector The schematic diagram of habit.In Fig. 3, skip-gram term vector model includes input layer, hidden layer and output layer, skip-gram word The one-hot of the current input word C of vector model is expressed as { 0,0,0,0,1,0 ... ... 0 }, exports the window word for C Probability value P₁,P₂,P₃……P_n。

During model training, if the group of the corresponding window word of current term is combined into { C, W }, network parameter is indicated For θ, then the objective function f trained are as follows:

Wherein p (c/w；It θ) is conditional probability of the vocabulary C when window word is W.

If hidden layer-output layer network weight matrix is W '_N*M, vocabulary C corresponds to input layer-hidden layer network weight Matrix W_N*MVector be V_c, window word W correspond to hidden layer-output layer network weight matrix W '_N*MVector be V_w, then defeated Out condition Probability p (c | w；θ) are as follows:

Take logarithm that can obtain trained objective function:

The renewal equation that network parameter θ derivation can be obtained by objective function logarithm θ, can be realized using the negative method of sampling The Optimization Solution of θ.

The group for inputting the corresponding window word of all words is combined into { C, W } and carries out to skip-gram term vector model Training, the term vector that may finally obtain each word indicate.

In step S206, the obtained word and corresponding term vector is trained to be added to each training text default In term vector tables of data.

In embodiments of the present invention, according to each training text word and its corresponding term vector table that finally training obtains Show, a preset term vector tables of data can be obtained.

Above-mentioned steps S201-S206 is the process for obtaining preset term vector tables of data, which can be of the invention real It applies in example and implements before the corresponding set of words of acquisition target text, but do not need corresponding in each execution acquisition target text Set of words the step of before implement, periodically the term vector tables of data can be updated, update when implement walk The method of rapid S201-S206.

In step S207, the corresponding set of words of target text is obtained, and obtain and needed really from the set of words Determine the target word of part of speech.

In embodiments of the present invention, above-mentioned steps S207 is referred to step S101, and details are not described herein again.

In step S208, it is corresponding from the default term vector tables of data to obtain each word in the set of words Term vector.

In embodiments of the present invention, the corresponding term vector of each word can be obtained from default term vector tables of data.

In step S209, the word of other each words in the term vector and the set of words of the target word is calculated Inner product between vector.

In embodiments of the present invention, if the term vector of target word is V_i=(V_i1, V_i2... ..., V_im), take set of words In except the target word local official anticipate other words, if the term vector of other words be V_j=(V_j1, V_j2... ..., Vjm), then the inner product per the two term vectors is (v_i,v_j)。

In step S210, calculate the term vector of the target word 2 norms and other each words word to The product of 2 norms of amount.

In embodiments of the present invention, the product of 2 norms of above-mentioned two term vector is | | v_i||||v_j||。

In step S211, the target word and other described words are determined according to the ratio of the inner product and the product Similarity between language.

In embodiments of the present invention, the similarity between COS distance method calculating term vector can be used, then target word The calculation formula of similarity s between other words is as follows:

Wherein (v_i,v_j) it is v_iWith v_jInner product, | | | | indicate 2 norms.

In step S212, is chosen from other described words and be greater than the first similarity with the similarity of the target word The word of threshold value, the similar word as the target word.

In embodiments of the present invention, it is chosen from other words in set of words in addition to target word and target word Similarity is greater than the word of the first similarity threshold as similar word.The purpose that the first similarity threshold is arranged is to protect The similarity for demonstrate,proving similar word and target word is sufficiently high, avoids the similar word chosen too low with the similarity of target word, So that similar word is for determining that the part of speech of target word does not have reference value.Wherein, which can root It is preset according to actual conditions, the specific value embodiment of the present invention is not specifically limited.

In step S213, the part of speech that each word in the similar word language whether is recorded in part of speech table detected.

In embodiments of the present invention, whether the part of speech of each word in similar word is inquired by preset part of speech table Know, if there is the word of unknown part of speech, the word of the unknown part of speech is marked.For example, the word for unknown part of speech adds Marking symbol U.

If record has the part of speech of partial words in the similar word language in the part of speech table, S214 is thened follow the steps；If institute Record has the part of speech of whole words in the similar word language in predicate table, thens follow the steps S216.

In step S214, if record has the part of speech of partial words in the similar word language in the part of speech table, from institute In the similar word recorded in predicate table, the phase with the highest known part of speech of the similarity of the target word is obtained Like word.

It if record has the part of speech of partial words in the similar word language in part of speech table, i.e., is not the word of each similar word Property it is known that then in the similar word of known part of speech, obtain the highest word of similarity with target word.

It in embodiments of the present invention, specifically, can be according to the sequence of the similarity with target word from high to low, by mesh Mark word is ranked up, and obtains a similar sequence of terms.Using preset part of speech table, judge in the similar sequence of terms The word of sequence first whether be recorded in the part of speech table.If there is record, using the similar word of the sequence first as with The similar word of the highest known part of speech of the similarity of the target word.If no record, the similar sequence of terms is judged The word part of speech of middle sequence second whether it is known that if there is record, using the similar word of the sequence second as with the target The similar word of the highest known part of speech of the similarity of word.Such as no record, then successively judge to sort third and fourth ... K backward Word whether be recorded in part of speech table, then will be described similar until finding the similar word being recorded in the part of speech table Word as to the similar word of the highest known part of speech of the similarity of the target word.

It, will be true to the part of speech of the similar word of the highest known part of speech of the similarity of the target word in step S215 It is set to the part of speech of the target word.

In embodiments of the present invention, the word part of speech identical probability higher feature high based on similarity, by similarity The part of speech of the similar word of highest known part of speech is determined as the part of speech of the target word.

The method that the part of speech of target word is determined in above-mentioned steps S212- step S215, it is high to take full advantage of similarity The biggish principle of the identical probability of word part of speech, and the current highest word of similarity in other words is selected to carry out target The part of speech of word determines, so that determining the fast speed of part of speech, accuracy is higher.

In step S216, if record has the part of speech of whole words in the similar word language in the part of speech table, count The highest part of speech of frequency of occurrence in the part of speech of the similar word.

In embodiments of the present invention, the part of speech of each word in similar sequence of terms is inquired by preset word part of speech table Whether it is known that the word of the unknown part of speech is marked if there is the word of unknown part of speech.For example, being unknown part of speech Word add label symbol U.

In step S217, the highest part of speech of the frequency of occurrence is determined as to the part of speech of the target word.

In embodiments of the present invention, the original high a possibility that identical as the part of speech of similar word according to the part of speech of target word The highest part of speech of frequency of occurrence in similar word, is determined as the part of speech of the target word by reason.

For example, setting set of words { W₁,W₂,……W_NIn known part of speech set of words be { U₁,U₂,……U_L, target The sequence of the similar word of word W is { W₁,W₂,……W_K, wherein K < N.If { W₁,W₂,……W_KBelong to { U₁, U₂,……U_L, then the part of speech of W can be taken as { W₁,W₂,……W_KIn the highest part of speech of frequency of occurrence.

In above-mentioned steps S216- step S217 determine target word part of speech method, similar word part of speech In the case where knowing, part of speech of the highest part of speech of frequency of occurrence as target word is chosen, this method does not need to search similarity most High similar word, it is easier.

In step S218, the part of speech of the target word and the target word is added in the part of speech table, is obtained Updated part of speech table, the updated part of speech table are used to determine the part of speech of next target word.

In embodiments of the present invention, after the part of speech of target word determines, which can be added in part of speech table, with Enrich the word quantity in part of speech table.After part of speech table updates, if the last next target word of target word for determining part of speech The similar word of language, then can using the last target word for determining part of speech as judgement target word words and phrases next time according to According to.

In conclusion the method for determining word part of speech provided in an embodiment of the present invention, except with shown in embodiment one It determines outside beneficial effect possessed by the method for word part of speech, word collection is also determined using word2vec term vector learning method It is similar between other words to set of words to calculate target word according to term vector again for the term vector of each word in conjunction Degree, and then the similar word of target word is obtained, and according to known to word part of speech in similar word or unknown, be respectively adopted not Same method determines the part of speech of target word, and principle is simple, strong operability, determines that the speed of target word words and phrases is fast；And The first similarity threshold is introduced in the method, the word of foundation and target word have when ensure that determining target word words and phrases There is high similarity, it is ensured that the accuracy of the definitive result of target word words and phrases.

Fig. 4 is that shown according to an exemplary embodiment the third determines the flow chart of the method for word part of speech, such as Fig. 4 institute To show, the method for the grammatical term for the character words and phrases accuracy is used in terminal, it can be executed after step 217 of the embodiment of the present invention, It can be executed after the part of speech of all target words determines, specifically include following steps.

In step S301, the similar word of the target word is arranged according to the similarity descending of the target word Column, obtain the similar sequence of terms of the target word.

In embodiments of the present invention, the word in similar sequence of terms is dropped according to the similarity of the target word Sequence is ordered from large to small, i.e., be located at sequence first with the highest word of target Words similarity, serial number 1, High word is located at sequence second with target Words similarity time, serial number 2, and so on, obtain the target word Similar sequence of terms.

In step s 302, the part of speech accuracy of target word is determined, obtains the accurate target word of part of speech. It is alternatively possible to be determined using part of speech accuracy of the step S303-S306 to target word；Step can also be passed through The scheme that S307-S309 is provided determines the part of speech accuracy of target word.

In the case where executing completion step S301, instruction can be inputted, executes step S303 or execution step with selected S307；Or it is performed simultaneously step S303 and S307.

In step S303, the word of the first setting serial number in the similar sequence of terms is obtained, as the first reference word Language.

Specifically, the first setting serial number can be set according to the actual situation, be typically chosen and target Words similarity Biggish word refers to word as first.Preferably, the first setting serial number can be set as 1.If the first setting serial number 1, Then first is to be located at the primary word of sequence in similar sequence of terms with reference to word.

In step s 304, the part of speech that described first whether is recorded in the part of speech table with reference to word detected.

Optionally, the part of speech table in the embodiment be will the part of speech of the target word and the target word be added described in In part of speech table, obtained updated part of speech table.

It, can be by the part of speech of target word and target word in determining target text after the part of speech of all target words Part of speech table is added, obtains updated part of speech table.Because having increased many target words in updated part of speech table, if should Target word is exactly the similar word of the target word of part of speech accuracy to be determined, then it can be determined for part of speech accuracy Foundation is provided.

Whether the part of speech that can detecte the first reference word by the step is known.If not recorded in updated part of speech table There is the described first part of speech with reference to word, then the part of speech of the first reference word is unknown, then can choose other words as first Judged with reference to word, for example, select in similar sequence of terms set serial number 2 word as first with reference to word into Row judgement.If the part of speech of the first reference word is still unknown, S307 is thened follow the steps.If recording in updated part of speech table has The part of speech of the first reference word, thens follow the steps S305.

In step S305, if so, judging that the part of speech of the target word and described first is with reference to the part of speech of word It is no consistent.

If first with reference to word part of speech it is known that if judge target word part of speech and this first refer to word part of speech be It is no consistent.If the part of speech of target word is consistent with the first reference part of speech of word, S306 is thened follow the steps；If target word Part of speech with this first with reference to word part of speech it is inconsistent, it is determined that the doubtful inaccuracy of the part of speech of target word.

It should be noted that in the presently disclosed embodiments, however, it is determined that go out the doubtful inaccuracy of part of speech of target word, then The part of speech of the target word can be redefined.

In an alternative embodiment, however, it is determined that the doubtful inaccuracy of part of speech for going out target word can then execute step S314。

In step S306, if so, determining that the part of speech of the target word is accurate.

In embodiments of the present invention, if the part of speech of target word is consistent with reference to the part of speech of word with first, illustrate target Word with and oneself part of speech of most like word it is consistent, then can be accurate with the part of speech of preliminary judgement target word.

For example, first is the word of serial number 1 in sequence with reference to word, if the part of speech of target word is noun, serial number 1 Word part of speech be also noun, then the part of speech of preliminary judgement target word is accurate.

In step S307, word of the sequence before the second setting serial number in the similar sequence of terms is obtained, as Second refers to word.

Specifically, the second setting serial number can be set according to the actual situation, it is preferable that if the similar sequence of terms In share K word, then second set serial number can be set as K/2 or K.If the second setting serial number K/2, the second reference Word is the word for being located at sequence first half in similar sequence of terms.

In step S308, described second is counted with reference to part of speech word identical with the part of speech of the target word in word Quantity.

In embodiments of the present invention, according to the part of speech of target word, statistics second with reference in word with target word words and phrases The quantity of consistent word.If the quantity is greater than default accounting threshold value with reference to the accounting in word total quantity described second, Then follow the steps S309；If the quantity is not more than default accounting threshold value with reference to the accounting in word total quantity described second, Then determine the doubtful inaccuracy of part of speech of the target word.

In step S309, if the quantity is greater than default accounting threshold with reference to the accounting in word total quantity described second Value, it is determined that the part of speech of the target word is accurate.

In embodiments of the present invention, presetting accounting threshold value may be set according to actual conditions, such as can be set to 0.5, If there is the word part of speech of half is identical as the part of speech of target word in i.e. second reference word, then the target word is determined Part of speech it is accurate.

For example, the part of speech of target word is noun, second with reference in word, there is the word part of speech of quantity accounting 0.7 all Noun, then according to the part of speech of target word and it is identical as the word part of speech that its similarity is big a possibility that big principle, determine The part of speech of the target word is accurate.

In order to further ensure that determine result accuracy, can further execute step S310-S312:

In step s310, using the accurate target word of part of speech as the first accurate set of words；

In embodiments of the present invention, after the part of speech to target word determines, it will determine that obtained part of speech is accurate Word is acquired, and obtains the first accurate set of words.It is subsequent can be accurate to the word part of speech in the first accurate set of words Property is determined again.

In step S311, the word of third setting serial number in the similar sequence of terms is obtained, as third reference word Language.

In embodiments of the present invention, third setting serial number can also be preset according to the actual situation, and optionally, third is set Sequencing number can be 1 or 2, and correspondingly, third is word of the sequence in similar sequence of terms the 1st or the 2nd with reference to word.

In step S312, if the third belongs to the described first accurate set of words, and the target word with reference to word The part of speech of language is identical with reference to the part of speech of word as the third, and the target word refers to the similar of word to the third Degree is greater than the second similarity threshold, then finally determines that the part of speech of the target word is accurate；Wherein, the second similarity threshold Value is greater than first similarity threshold.

In embodiments of the present invention, if third belongs to the first accurate set of words with reference to word, illustrate third reference word Language is higher as the property of can refer to for determining part of speech accuracy, identical with reference to the part of speech of word as third in the part of speech of target word In the case where, in order to further ensure that the accuracy of part of speech judgement, target word and third can be chosen with reference to the similar of word Degree is further determined, if the two similarity is greater than the second similarity threshold th2, determines the target word again Part of speech is accurate.Equally, in order to ensure determining the accuracy of part of speech again, the second similarity threshold th2 can be set greater than first Similarity threshold th1.

In embodiments of the present invention, determine twice by above-mentioned, filtered out part of speech and determined accurate target word, after The continuous target word that can filter out part of speech decision error, to be corrected to its part of speech.

Optionally, it after the part of speech for finally determining the target word is accurate, can also determine in the target word The target word of part of speech mistake, includes the following steps S313-S316:

In step S313, using the accurate target word of part of speech finally determined as the second accurate set of words.

In embodiments of the present invention, available to be judged as the accurate target word of part of speech by determining twice, by this A little target word acquisitions are got up, and the second accurate set of words is obtained.

In step S314, the target word for being confirmed as the doubtful inaccuracy of part of speech is obtained, obtains doubtful wrong word collection It closes.

Correspondingly, more available after accuracy of the corresponding steps above to target word words and phrases determines It is judged as the target word of doubtful inaccuracy, the acquisition of these words is got up, obtains doubtful wrong set of words.

In step S315, it is corresponding similar to obtain the doubtful wrong word of each of the doubtful wrong set of words Set of words obtains doubtful wrong word similar word language set.

Specifically, for each of doubtful wrong set of words word, its corresponding similar set of words is obtained, it will The corresponding similar word of each doubtful wrong word combines, and is named as doubtful wrong word similar word language set.

If the doubtful wrong word similar word language set does not include identical word with the described second accurate set of words, or The part of speech of the identical word is identical as the part of speech of the target word, it is determined that the part of speech mistake of the target word.

In step S316, if the doubtful wrong word similar word language set includes with the described second accurate set of words Identical word, and the part of speech of the identical word is different from the part of speech of the target word, it is determined that the word of the target word Property mistake.

In embodiments of the present invention, belong to the accurate set of words if existing in doubtful mistake word similar word language set In word, which can be named as to intersection word, because intersection word also belongs to the second accurate set of words, then said The part of speech of bright intersection word is accurate, if target word and the intersection Words similarity are high, i.e., similarity is higher than the first similarity threshold Value th1 then illustrates that a possibility that target word is consistent with the intersection word part of speech is very high.At this point, if the part of speech of target word It is inconsistent with the part of speech of intersection word, then illustrate the part of speech mistake of the target word.

Optionally, after the part of speech mistake for determining target word, the part of speech of the target word of the part of speech mistake is corrected. The correcting method can be following steps S317- step S319.

In step S317, the corresponding similar word of target word of part of speech mistake is obtained as error correction word collection, and unite Count the frequency of occurrence that the error correction word concentrates various parts of speech.

In embodiments of the present invention, the corresponding similar word of word of part of speech mistake is obtained as error correction word collection, statistics The error correction word concentrates the part of speech of each word, calculates the number that every kind of part of speech occurs.

In step S318, if the error correction word is concentrated there are the part of speech that frequency of occurrence is more than or equal to 2, with occurrence out The highest part of speech of number replaces the part of speech of the target word of the part of speech mistake.

In embodiments of the present invention, if error correction word is concentrated there are the part of speech that frequency of occurrence is more than or equal to 2, illustrate entangling There are the aggregation situations of part of speech for wrong word concentration, certain part of speech frequency of occurrence is more, then the part of speech of target word belongs to the part of speech Probability it is bigger.So the target word for taking error correction word that the highest part of speech of frequency of occurrence is concentrated to replace the part of speech mistake Part of speech.

In step S319, if the error correction word is concentrated, there is no the parts of speech that frequency of occurrence is more than or equal to 2, described in Error correction word concentrates part of speech corresponding with the highest word of target Words similarity of the part of speech mistake, and it is wrong to replace the part of speech The part of speech of target word accidentally.

In embodiments of the present invention, it if the error correction word is concentrated there is no the part of speech that frequency of occurrence is more than or equal to 2, says It is bright to concentrate the part of speech of word different in error correction word, then the part of speech of word can not be concentrated to determine target according to error correction word The part of speech of word.At this point it is possible to the highest word of similarity chosen with target word be concentrated from error correction word, with the word Part of speech replace the part of speech mistake target word part of speech, the correction to target word words and phrases is completed with this.

In conclusion in embodiments of the present invention, determined twice by the part of speech accuracy to the target word, Finally determine the target word of the doubtful inaccuracy of part of speech in the target word, and when the part of speech for determining target word is doubtful After inaccuracy, the part of speech of the target word is corrected.The above method can accurately distinguish out the mesh that part of speech is determined mistake Mark word, and can the target word words and phrases accurately to part of speech mistake correct so that implementing through the invention The part of speech that the method that part of speech is determined in example determines has obtained effective inspection, and accurately corrects for the part of speech of mistake.

Table 1 is the table that 6 kinds of target words shown according to an exemplary embodiment determine part of speech using the present invention program Schematic diagram.

Table 1

In the specific implementation process, using the new word discovery method of improved mutual imformation and adjacent entropy to targeted website 80,000,000 Video comments carry out new word discovery, find 1,500,000 neologisms altogether, by 1,500,000 neologisms be added new word dictionary and as needing it is true The new word dictionary neologisms are added among the original word dictionary of jieba dictionary for word segmentation 350,000 the target word for determining part of speech, use The video comments of targeted website 80,000,000 are segmented based on the jieba segmenting method of new dictionary, obtain set of words, from In preset term vector tables of data in query terms set each word term vector.Preset term vector tables of data by pair Skip-gram term vector model training obtains.When carrying out term vector model training, one-hot vector dimension is set as 200, Skip-gram window parameter is set as 2, and initial learning rate is set as 0.025, carries out model training using the negative method of sampling, Obtain training the term vector of word.It is reference with the original 350,000 word part of speech of dictionary of jieba dictionary for word segmentation, carries out target word words and phrases Property calculate, word number K is set as 8 in similar sequence of terms, and the first similarity threshold th1 is set as 0.5, obtains 1,500,000 mesh Mark the part of speech of word.It then carries out doubtful wrong part of speech screening and corrects, the second similarity threshold th2 of setting is 0.7, is obtained The correction part of speech of doubtful false target word, and then obtain the final part of speech of the 1500000 target word.From having determined that final word Property 1,500,000 target words in randomly select 20,000 words, artificial judgment is carried out to the parts of speech of 20,000 words, judges to tie Fruit shows to judge that the accuracy of target word words and phrases is higher than 90% using present invention method.

Table 1 lists 6 kinds of target words: " people sets ", " beating call ", " net is red ", " old iron ", " quick worker " " not having defect " Similar sequence of terms and corresponding similarity have listed file names with the part of speech that each target word finally determines.Observing result can Know, present invention method realizes accurate part of speech and calculates.

Wherein, " target word " is the target word of part of speech to be determined in table 1, and " similar word " is corresponding for target word Similar word, " reference " is to determine the first of part of speech accuracy with reference to word, and " similarity " is that target word is referred to first The similarity of word, " part of speech " are finally determining part of speech.The part of speech for the target word that can finally determine, which can be seen that, adopts Determine that the part of speech accuracy rate of 6 kinds of target words is higher with the method for the embodiment of the present invention.

Fig. 5 is the block diagram of the first determination word part of speech device shown according to an exemplary embodiment.It, should referring to Fig. 5 Device includes that target word obtains module 501, similarity determining module 502, part of speech determining module 503.

Wherein, target word obtains module 501, is configured as executing and obtains the corresponding set of words of target text, and from The target word it needs to be determined that part of speech is obtained in the set of words；

Similarity determining module 502 is configured as executing according to the corresponding term vector of word each in the set of words, Determine the similarity in the target word and the set of words between other words, wherein include in other described words The word of known part of speech；

Part of speech determining module 503 is configured as executing based on known part of speech in the target word and other described words Word similarity, determine the part of speech of the target word.

In determining word part of speech device provided in an embodiment of the present invention, by obtaining the corresponding set of words of target text, And the target word it needs to be determined that part of speech is obtained from the set of words, it is corresponding according to word each in the set of words Term vector, determine the similarity in the target word and the set of words between other words, be based on the target word The similarity of the word of known part of speech, finally determines the part of speech of the target word in language and other described words.In above-mentioned side It is higher according to the Words similarity with same or similar context in method, and the higher word of similarity, part of speech is identical can The energy biggish principle of property, the similarity based on target word and the word of known part of speech carry out the part of speech of target word automatic It calculates, compared to the method that artificial mark obtains part of speech, efficiency and accuracy rate are higher.

Fig. 6 is the block diagram of second shown according to an exemplary embodiment determining word part of speech device.It, should referring to Fig. 6 Device 600 includes that target word obtains module 601, similarity determining module 602, part of speech determining module 603.

Wherein, target word obtains module 601, is configured as executing and obtains the corresponding set of words of target text, and from The target word it needs to be determined that part of speech is obtained in the set of words；

Similarity determining module 602 is configured as executing according to the corresponding term vector of word each in the set of words, Determine the similarity in the target word and the set of words between other words, wherein include in other described words The word of known part of speech；

Part of speech determining module 603 is configured as executing based on known part of speech in the target word and other described words Word similarity, determine the part of speech of the target word.

Optionally, the part of speech determining module 603 includes:

Similar word chooses submodule 6031, is configured as executing and choose and the target word from other described words Similarity be greater than the first similarity threshold word, the similar word as the target word；

Whether part of speech detection sub-module 6032 is configured as having recorded in execution detection part of speech table each in the similar word language The part of speech of a word；

First similar word acquisition submodule 6033, if being configured as executing record in the part of speech table has the similar word The part of speech of partial words in language obtains and the target word then from the similar word recorded in the part of speech table The similar word of the highest known part of speech of similarity；

First part of speech determines submodule 6034, and being configured as executing will be highest known with the similarity of the target word The part of speech of the similar word of part of speech is determined as the part of speech of the target word.

Optionally, the part of speech determining module 603 further include:

Part of speech statistic submodule 6035, if being configured as executing record in the part of speech table has in the similar word language all The part of speech of word then counts the highest part of speech of frequency of occurrence in the part of speech of the similar word；

Second part of speech determines submodule 6036, be configured as executing the highest part of speech of the frequency of occurrence is determined as it is described The part of speech of target word.

Optionally, the described first similar word acquisition submodule 6033, comprising:

First similar word acquiring unit is configured as executing according to the similarity with the target word from high to low Sequentially, whether the part of speech for successively detecting the similar word is recorded in the part of speech table, until detecting to be recorded in institute's predicate Property table in similar word, then using the similar word as with the highest known part of speech of the similarity of the target word Similar word.

Optionally, described device 600 further include:

Part of speech table update module 604 is configured as executing the addition of the part of speech of the target word and the target word In the part of speech table, updated part of speech table is obtained, the updated part of speech table is used to determine the word of next target word Property.

Optionally, the similarity determining module 602, comprising:

Term vector acquisition submodule 6021 is configured as execution and obtains the set of words from default term vector tables of data In the corresponding term vector of each word；

Inner product computational submodule 6022 is configured as executing the term vector for calculating the target word and the set of words In other each words term vector between inner product；

Product computational submodule 6023, be configured as executing 2 norms of the term vector for calculating the target word with it is described The product of 2 norms of the term vector of other each words；

Similarity determines submodule 6024, is configured as executing according to the determination of the ratio of the inner product and the product Similarity between target word and other described words.

Optionally, the term vector acquisition submodule 6021, comprising:

Optionally, the object vector establishes unit, comprising:

In conclusion the device 600 of determining word part of speech provided in an embodiment of the present invention, removing has shown in Fig. 5 really Determine outside beneficial effect possessed by the device 500 of word part of speech, word collection is also determined using word2vec term vector learning method It is similar between other words to set of words to calculate target word according to term vector again for the term vector of each word in conjunction Degree, and then the similar word of target word is obtained, and according to known to word part of speech in similar word or unknown, be respectively adopted not Same method determines the part of speech of target word, and principle is simple, strong operability, determines that the speed of target word words and phrases is fast；And The first similarity threshold is introduced in the method, the word of foundation and target word have when ensure that determining target word words and phrases There is high similarity, it is ensured that the accuracy of the definitive result of target word words and phrases.

Fig. 7 is that shown according to an exemplary embodiment the third determines the block diagram of word part of speech device.It, should referring to Fig. 7 Device 700 includes: that similar sequence of terms obtains module 701, and first obtains module 702 with reference to word, and detection module 703 is sentenced Disconnected module 704, first determines the accurate module 705 of part of speech.

Wherein, the similar sequence of terms obtains module 701, is configured as executing the similar word of the target word It is arranged according to the similarity descending of the target word, obtains the similar sequence of terms of the target word；

First obtains module 702 with reference to word, is configured as executing the first setting sequence in the acquisition similar sequence of terms Number word, as first refer to word；

Detection module 703 is configured as executing detecting described first whether is recorded in the part of speech table with reference to word Part of speech；

Judgment module 704, be configured as execute if so, judge the target word part of speech and first reference word Whether the part of speech of language is consistent；

First determines the accurate module 705 of part of speech, is configured as executing if so, determining that the part of speech of the target word is quasi- Really.

Optionally, described device 700 further include:

Second obtains module 706 with reference to word, is configured as executing obtaining in the similar sequence of terms and sort second The word before serial number is set, refers to word as second；

The identical word statistical module 707 of part of speech, be configured as executing statistics described second with reference to part of speech in word with it is described The quantity of the identical word of the part of speech of target word；

Second determines the accurate module 708 of part of speech, if being configured as executing the quantity described second with reference to word sum Accounting in amount is greater than default accounting threshold value, it is determined that the part of speech of the target word is accurate.

Optionally, described device 700 further include:

First accurate set of words determining module 709 is configured as executing using the accurate target word of part of speech as first Accurate set of words；

Third obtains module 710 with reference to word, is configured as executing third setting sequence in the acquisition similar sequence of terms Number word, as third refer to word；

It is final to determine the accurate module 711 of part of speech, belong to first standard with reference to word if being configured as the execution third True set of words, and the part of speech of the target word and the third are identical with reference to the part of speech of word, and the target word with The third is greater than the second similarity threshold with reference to the similarity of word, then finally determines that the part of speech of the target word is accurate； Wherein, second similarity threshold is greater than first similarity threshold.

Optionally, described device 700 further include:

Second accurate set of words determining module 712 is configured as executing the accurate mesh of part of speech finally determined Word is marked as the second accurate set of words；

Doubtful mistake set of words obtains module 713, is configured as execution acquisition and is confirmed as the doubtful inaccuracy of part of speech Target word obtains doubtful wrong set of words；

Doubtful mistake word similar word language set obtains module 714, is configured as executing the acquisition doubtful wrong word The corresponding similar set of words of the doubtful mistake word of each of set obtains doubtful wrong word similar word language set；

Determine part of speech error module 715, if be configured as executing the doubtful wrong word similar word language set with it is described Second accurate set of words includes identical word, and the part of speech of the identical word is different from the part of speech of the target word, then Determine the part of speech mistake of the target word.

Optionally, described device 700 further include:

Error correction word collection obtains module 716, is configured as executing the corresponding similar word of target word for obtaining part of speech mistake Language counts the frequency of occurrence that the error correction word concentrates various parts of speech as error correction word collection；

First replacement module 717, if being configured as the execution error correction word concentration, there are frequency of occurrence to be more than or equal to 2 Part of speech then replaces the part of speech of the target word of the part of speech mistake with the highest part of speech of frequency of occurrence；

Frequency of occurrence is not present more than or equal to 2 if being configured as the execution error correction word and concentrating in second replacement module 718 Part of speech, then concentrate corresponding with the highest word of target Words similarity of part of speech mistake word with the error correction word Property, replace the part of speech of the target word of the part of speech mistake.

In conclusion determining word part of speech device 700 provided in an embodiment of the present invention, passes through the word to the target word Property accuracy determined twice, finally determine the target word of the doubtful inaccuracy of part of speech in the target word, and when sentencing It sets the goal after the doubtful inaccuracy of part of speech of word, the part of speech of the target word is corrected.The above method can accurately distinguish Out part of speech be determined mistake target word, and can the target word words and phrases accurately to part of speech mistake correct, into And the part of speech that the method that part of speech is determined in through the embodiment of the present invention is determined has obtained effective inspection, and accurately corrects The part of speech of mistake.

About the device in above-described embodiment, wherein modules execute the concrete mode of operation in related this method Embodiment in be described in detail, no detailed explanation will be given here.

Fig. 8 is shown according to an exemplary embodiment a kind of for determining the block diagram of the electronic equipment 800 of word part of speech. For example, electronic equipment 800 can be mobile phone, and computer, digital broadcasting terminal, messaging device, game console, Tablet device, Medical Devices, body-building equipment, personal digital assistant etc..

Referring to Fig. 8, electronic equipment 800 may include following one or more components: processing component 802, memory 804, Electric power assembly 806, multimedia component 808, audio component 810, the interface 812 of input/output (I/O), sensor module 814, And communication component 816.

The integrated operation of the usual controlling electronic devices 800 of processing component 802, such as with display, call, data are logical Letter, camera operation and record operate associated operation.Processing component 802 may include one or more processors 820 to hold Row instruction, to perform all or part of the steps of the methods described above.In addition, processing component 802 may include one or more moulds Block, convenient for the interaction between processing component 802 and other assemblies.For example, processing component 802 may include multi-media module, with Facilitate the interaction between multimedia component 808 and processing component 802.

Memory 804 is configured as storing various types of data to support the operation in equipment 800.These data are shown Example includes the instruction of any application or method for operating on electronic equipment 800, contact data, telephone directory number According to, message, picture, video etc..Memory 704 can by any kind of volatibility or non-volatile memory device or it Combination realize, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM) can Erasable programmable read-only memory (EPROM) (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, Flash memory, disk or CD.

Power supply module 806 provides electric power for the various assemblies of electronic equipment 800.Power supply module 806 may include power supply pipe Reason system, one or more power supplys and other with for electronic equipment 800 generate, manage, and distribute the associated component of electric power.

Multimedia component 808 includes the screen of one output interface of offer between the electronic equipment 800 and user. In some embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch surface Plate, screen may be implemented as touch screen, to receive input signal from the user.Touch panel includes one or more touchings Sensor is touched to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or cunning The boundary of movement, but also detect duration and pressure associated with the touch or slide operation.In some embodiments In, multimedia component 808 includes a front camera and/or rear camera.When equipment 800 is in operation mode, as clapped When taking the photograph mode or video mode, front camera and/or rear camera can receive external multi-medium data.Before each Setting camera and rear camera can be a fixed optical lens system or has focusing and optical zoom capabilities.

Audio component 810 is configured as output and/or input audio signal.For example, audio component 810 includes a Mike Wind (MIC), when electronic equipment 800 is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone It is configured as receiving external audio signal.The received audio signal can be further stored in memory 804 or via logical Believe that component 816 is sent.In some embodiments, audio component 810 further includes a loudspeaker, is used for output audio signal.

I/O interface 812 provides interface between processing component 802 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include, but are not limited to: home button, volume button, start button and Locking press button.

Sensor module 814 includes one or more sensors, for providing the state of various aspects for electronic equipment 800 Assessment.For example, sensor module 814 can detecte the state that opens/closes of equipment 800, the relative positioning of component, such as The component is the display and keypad of electronic equipment 800, and sensor module 814 can also detect electronic equipment 800 or electricity The position change of sub- 800 1 components of equipment, the existence or non-existence that user contacts with electronic equipment 800, electronic equipment 800 The temperature change of orientation or acceleration/deceleration and electronic equipment 800.Sensor module 814 may include proximity sensor, be matched It sets for detecting the presence of nearby objects without any physical contact.Sensor module 814 can also include light sensing Device, such as CMOS or ccd image sensor, for being used in imaging applications.In some embodiments, the sensor module 814 It can also include acceleration transducer, gyro sensor, Magnetic Sensor, pressure sensor or temperature sensor.

Communication component 816 is configured to facilitate the communication of wired or wireless way between electronic equipment 800 and other equipment. Electronic equipment 800 can access the wireless network based on communication standard, such as WiFi, carrier network (such as 2G, 3G, 4G or 5G) or their combination.In one exemplary embodiment, communication component 816 is received via broadcast channel from external wide The broadcast singal or broadcast related information of broadcast management system.In one exemplary embodiment, the communication component 816 also wraps Near-field communication (NFC) module is included, to promote short range communication.For example, it can be based on radio frequency identification (RFID) technology in NFC module, it is red Outer data association (IrDA) technology, ultra wide band (UWB) technology, bluetooth (BT) technology and other technologies are realized.

In the exemplary embodiment, electronic equipment 800 can by one or more application specific integrated circuit (ASIC), Digital signal processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate Array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.

In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided It such as include the memory 804 of instruction, above-metioned instruction can be executed by the processor 820 of electronic equipment 800 to complete the above method. For example, the non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, magnetic Band, floppy disk and optical data storage devices etc..

Fig. 9 is shown according to an exemplary embodiment a kind of for determining the block diagram of the electronic equipment 900 of word part of speech. For example, electronic equipment 900 may be provided as a server.Referring to Fig. 9, electronic equipment 900 includes processing component 922, It further comprise one or more processors, and the memory resource as representated by memory 932, it can be by for storing Manage the instruction of the execution of component 922, such as application program.The application program stored in memory 932 may include one or one Each more than a corresponds to the module of one group of instruction.In addition, processing component 922 is configured as executing instruction, on executing State recommended method.

Electronic equipment 900 can also include that a power supply module 928 is configured as executing the power supply pipe of electronic equipment 900 Reason, a wired or wireless network interface 950 are configured as electronic equipment 900 being connected to network and an input and output (I/O) interface 959.Electronic equipment 900 can be operated based on the operating system for being stored in memory 932, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or similar.

Those skilled in the art after considering the specification and implementing the invention disclosed here, will readily occur to of the invention its Its embodiment.This application is intended to cover any variations, uses, or adaptations of the invention, these modifications, purposes or Person's adaptive change follows general principle of the invention and including the undocumented common knowledge in the art of the disclosure Or conventional techniques.The description and examples are only to be considered as illustrative, and true scope and spirit of the invention are by following Claim is pointed out.

It should be understood that the present invention is not limited to the precise structure already described above and shown in the accompanying drawings, and And various modifications and changes may be made without departing from the scope thereof.The scope of the present invention is limited only by the attached claims.

Claims

1. a kind of method of determining word part of speech characterized by comprising

The corresponding set of words of target text is obtained, and obtains the target word it needs to be determined that part of speech from the set of words；

According to the corresponding term vector of word each in the set of words, its in the target word and the set of words is determined Similarity between his word, wherein include the word of known part of speech in other described words；

Based on the similarity of the word of known part of speech in the target word and other described words, the target word is determined Part of speech.

2. the method according to claim 1, wherein described based in the target word and other described words The similarity of the word of known part of speech determines the part of speech of the target word, comprising:

The word for being greater than the first similarity threshold with the similarity of the target word is chosen from other described words, as institute State the similar word of target word；

If record has the part of speech of partial words in the similar word language, the institute recorded from the part of speech table in the part of speech table It states in similar word, obtains and the similar word of the highest known part of speech of the similarity of the target word；

It will be determined as the target word to the part of speech of the similar word of the highest known part of speech of the similarity of the target word Part of speech.

3. according to the method described in claim 2, it is characterized in that, whether having recorded the similar word language in detection part of speech table In each word part of speech after, the method also includes:

If record has the part of speech of whole words in the similar word language in the part of speech table, the part of speech of the similar word is counted The middle highest part of speech of frequency of occurrence；

4. according to the method described in claim 2, it is characterized in that, the similar word recorded from the part of speech table In, it obtains and the similar word of the highest known part of speech of the similarity of the target word, comprising:

According to the sequence of the similarity with the target word from high to low, whether the part of speech for successively detecting the similar word is remembered Record in the part of speech table, until the similar word for detecting to be recorded in the part of speech table, then using the similar word as To the similar word of the highest known part of speech of the similarity of the target word.

5. according to the method in claim 2 or 3, which is characterized in that based on the target word and other described words In known part of speech word similarity, after the part of speech for determining the target word, the method also includes:

The part of speech of the target word and the target word is added in the part of speech table, updated part of speech table, institute are obtained Updated part of speech table is stated for determining the part of speech of next target word.

6. the method according to claim 1, wherein described corresponding according to word each in the set of words Term vector determines the similarity in the target word and the set of words between other words, comprising:

Calculate the inner product in the term vector and the set of words of the target word between the term vector of other each words；

Calculate the product of 2 norms of 2 norms of the term vector of the target word and the term vector of other each words；

The similarity between the target word and other described words is determined according to the ratio of the inner product and the product.

7. according to the method described in claim 6, it is characterized in that, the default term vector tables of data obtains as follows It arrives:

Obtain the corresponding trained set of words of each training text；

The corresponding object vector of each word is established according to the frequency of occurrence of each word in the trained set of words；

In the trained set of words, the corresponding window word of each word is determined according to the window parameter chosen in advance；

Respectively using the object vector of the word in the combination as the input of object module, window word in the combination Desired output information of the object vector as the object module, is trained the object module, and by the target mould The vector of the hidden layer output of type is as term vector；

The obtained word and corresponding term vector is trained to be added in default term vector tables of data each training text.

8. a kind of determining word part of speech device characterized by comprising

Target word obtains module, is configured as executing and obtains the corresponding set of words of target text, and from the set of words It is middle to obtain the target word it needs to be determined that part of speech；

Similarity determining module is configured as executing determining institute according to the corresponding term vector of word each in the set of words State the similarity in target word and the set of words between other words, wherein include known words in other described words The word of property；

Part of speech determining module is configured as executing based on the word of known part of speech in the target word and other described words Similarity determines the part of speech of the target word.

9. a kind of electronic equipment, comprising: processor；Memory for storage processor executable instruction；Wherein, the processing Device is configured as executing the method such as determining word part of speech of any of claims 1-7.

10. a kind of application program/computer program product, when the instruction in the storage medium is held by the processor of mobile terminal When row, so that mobile terminal is able to carry out the method such as determining word part of speech of any of claims 1-7.