CN107832301B

CN107832301B - Word segmentation processing method and device, mobile terminal and computer readable storage medium

Info

Publication number: CN107832301B
Application number: CN201711175299.4A
Authority: CN
Inventors: 肖求根; 郑利群; 詹金波; 邓卓彬; 何径舟
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2017-11-22
Filing date: 2017-11-22
Publication date: 2021-09-17
Anticipated expiration: 2037-11-22
Also published as: CN107832301A

Abstract

The invention provides a word segmentation processing method, a word segmentation processing device, a mobile terminal and a computer readable storage medium, wherein the method comprises the following steps: when a sentence to be segmented is obtained, determining a target language type corresponding to the sentence to be segmented; respectively acquiring a first feature vector corresponding to each single word in the sentence to be segmented, a second feature vector corresponding to the two words and a third feature vector corresponding to a proper noun in the sentence to be segmented according to the target language type; determining a current fourth feature vector of each single character according to the first feature vector, the second feature vector and the third feature vector; and performing word segmentation on the sentence to be word segmented according to a preset Chinese character label transfer matrix and the current fourth feature vector of each single character. Therefore, word segmentation processing of the to-be-segmented sentences is achieved according to the target language types corresponding to the to-be-segmented sentences, the accuracy of word segmentation of the to-be-segmented sentences of various language types is improved, appropriate resources can be loaded according to needs, the storage space of the mobile terminal is saved, and user experience is improved.

Description

Word segmentation processing method and device, mobile terminal and computer readable storage medium

Technical Field

The present invention relates to the field of word segmentation processing technologies, and in particular, to a word segmentation processing method and apparatus, a mobile terminal, and a computer-readable storage medium.

Background

With the continuous development of computer technology, word segmentation technology has been widely applied in the fields of search engines, machine translation, speech synthesis, automatic summarization, etc. The word segmentation technology refers to a technology of segmenting a sentence or a segment of characters into a word.

In the prior art, word segmentation processing is generally performed on a sentence to be segmented by using a statistical-based word segmentation model or a dictionary-based word segmentation model. However, the current word segmentation model is often obtained by training the corpus of a specific language, so that the accuracy is low and the user experience is poor when performing word segmentation processing on other languages.

Disclosure of Invention

The present invention is directed to solving, at least to some extent, one of the technical problems in the related art.

Therefore, the invention provides a word segmentation processing method, which realizes word segmentation processing of the to-be-segmented sentences according to the target language types corresponding to the to-be-segmented sentences, improves the accuracy of word segmentation of the to-be-segmented sentences of various language types, can realize loading of appropriate resources according to needs, saves the storage space of the mobile terminal, and improves the user experience.

The invention also provides a word segmentation processing device.

The invention further provides the mobile terminal.

The invention also provides a computer readable storage medium.

An embodiment of a first aspect of the present invention provides a word segmentation processing method, including: when a sentence to be segmented is obtained, determining a target language type corresponding to the sentence to be segmented; respectively acquiring a first feature vector corresponding to each single word in the sentence to be segmented, a second feature vector corresponding to the two words and a third feature vector corresponding to a proper noun in the sentence to be segmented according to the target language type; determining a current fourth feature vector of each single character according to the first feature vector, the second feature vector and the third feature vector; and performing word segmentation on the sentence to be word segmented according to a preset Chinese character label transfer matrix and the current fourth feature vector of each single character.

When a sentence to be segmented is obtained, a target language type corresponding to the sentence to be segmented is determined, then a first feature vector corresponding to each single character, a second feature vector corresponding to two characters and a third feature vector corresponding to a proper noun in the sentence to be segmented are respectively obtained according to the target language type, then a current fourth feature vector of each single character is determined according to the first feature vector, the second feature vector and the third feature vector, and therefore the sentence to be segmented is segmented according to a preset Chinese character label transfer matrix and the current fourth feature vector of each single character. Therefore, word segmentation processing of the to-be-segmented sentences is achieved according to the target language types corresponding to the to-be-segmented sentences, the accuracy of word segmentation of the to-be-segmented sentences of various language types is improved, appropriate resources can be loaded according to needs, the storage space of the mobile terminal is saved, and user experience is improved.

An embodiment of a second aspect of the present invention provides a word segmentation processing apparatus, including: the system comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for determining a target language type corresponding to a sentence to be segmented when the sentence to be segmented is obtained; the first obtaining module is used for respectively obtaining a first feature vector corresponding to each single word in the sentence to be participated, a second feature vector corresponding to the two words and a third feature vector corresponding to a proper noun in the sentence to be participated according to the target language type; the second determining module is used for determining the current fourth feature vector of each single character according to the first feature vector, the second feature vector and the third feature vector; and the first processing module is used for carrying out word segmentation processing on the sentence to be word segmented according to a preset Chinese character label transfer matrix and the current fourth feature vector of each single word.

When the word segmentation processing device of the embodiment of the invention is used for obtaining the sentence to be segmented, firstly, the target language type corresponding to the sentence to be segmented is determined, then, after the first feature vector corresponding to each single word, the second feature vector corresponding to two words and the third feature vector corresponding to the proper noun in the sentence to be segmented are respectively obtained according to the target language type, the current fourth feature vector of each single word is determined according to the first feature vector, the second feature vector and the third feature vector, and therefore, the sentence to be segmented is subjected to word segmentation according to the preset Chinese character label transfer matrix and the current fourth feature vector of each single word. Therefore, word segmentation processing of the to-be-segmented sentences is achieved according to the target language types corresponding to the to-be-segmented sentences, the accuracy of word segmentation of the to-be-segmented sentences of various language types is improved, appropriate resources can be loaded according to needs, the storage space of the mobile terminal is saved, and user experience is improved.

An embodiment of a third aspect of the present invention provides a mobile terminal, including:

a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the word segmentation processing method according to the first aspect when executing the program.

A fourth aspect of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the word segmentation processing method according to the first aspect.

Drawings

The foregoing and/or additional aspects and advantages of the present invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

FIG. 1 is a flow diagram of a method of word segmentation processing in accordance with one embodiment of the present invention;

FIG. 2 is a flow diagram of a method of word segmentation processing in accordance with another embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a word segmentation processing device according to an embodiment of the present invention;

FIG. 4 is a schematic structural diagram of a word segmentation processing device according to another embodiment of the present invention;

fig. 5 is a schematic structural diagram of a mobile terminal according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention.

Specifically, the embodiments of the present invention provide a word segmentation processing method for solving the problems in the prior art that a word segmentation process is usually performed on a sentence to be segmented by using a statistical-based word segmentation model or a dictionary-based word segmentation model, but the current word segmentation model is usually obtained by training a corpus of a specific language, so that the accuracy is low and the user experience is poor when performing word segmentation process on other languages.

When a sentence to be segmented is obtained, a target language type corresponding to the sentence to be segmented is determined, then a first feature vector corresponding to each single character, a second feature vector corresponding to two characters and a third feature vector corresponding to a proper noun in the sentence to be segmented are respectively obtained according to the target language type, then a current fourth feature vector of each single character is determined according to the first feature vector, the second feature vector and the third feature vector, and accordingly the sentence to be segmented is segmented according to a preset Chinese character label transfer matrix and the current fourth feature vector of each single character. Therefore, word segmentation processing of the to-be-segmented sentences is achieved according to the target language types corresponding to the to-be-segmented sentences, the accuracy of word segmentation of the to-be-segmented sentences of various language types is improved, appropriate resources can be loaded according to needs, the storage space of the mobile terminal is saved, and user experience is improved.

The word segmentation processing method provided by the embodiment of the invention is described in detail below with reference to the accompanying drawings.

Fig. 1 is a flowchart of a word segmentation processing method according to an embodiment of the present invention.

As shown in fig. 1, the word segmentation processing method includes:

step 101, when a to-be-segmented sentence is obtained, determining a target language type corresponding to the to-be-segmented sentence.

The main execution body of the word segmentation processing method provided by the embodiment of the invention is the word segmentation processing device provided by the embodiment of the invention, and the device can be configured in any mobile terminal to perform word segmentation processing on a sentence to be segmented.

Specifically, the target language type corresponding to the sentence to be segmented can be determined in the following manner.

Step 101a, determining a feature vector of a sentence to be participated.

The feature vector is used for representing the acquired features of the sentence to be segmented.

In specific implementation, after the statement to be segmented is obtained, the feature vector of the obtained statement to be segmented can be determined by a plurality of methods such as a Mel cepstrum coefficient, a linear prediction cepstrum coefficient, a multimedia content description interface and the like.

And step 101b, determining a target language type to which the sentence to be segmented belongs according to the matching degree of the feature vector and each preset language type model.

Specifically, each language type model can be obtained by training in advance according to a large number of historical corpora of various types of languages, so that after the feature vectors of the obtained sentences to be participled are determined, the feature vectors can be input into each language type model to be checked and scored, and the language type model with the highest score, namely the language type corresponding to the language type model with the highest matching degree of the feature vectors, is determined as the target language type to which the sentences to be participled belong.

102, respectively obtaining a first feature vector corresponding to each single word in the sentence to be segmented, a second feature vector corresponding to two characters, and a third feature vector corresponding to a proper noun in the sentence to be segmented according to the target language type.

The single word can be the minimum division unit when the word is divided in the sentence to be divided. For example, when the target language type is a Chinese language type, a single character can be a character; when the target language type is the english type, the single word may be one word.

The term "proper term" refers to a term that is unique and not suitable for division, such as "person", "place", "thing", "organization", and the like, and refers to, for example, himalayashan, zhugeliang, and the like. It should be noted that idioms, encyclopedia names, and the like can also be considered as proper nouns.

Specifically, different proper noun dictionaries corresponding to different language types can be preset, each proper noun dictionary comprises a plurality of proper nouns, so that after the target language type corresponding to the sentence to be segmented is determined, the sentence to be segmented can be matched with the proper noun dictionary corresponding to the target language type, and a word matched with any proper noun in the proper noun dictionary corresponding to the target language type in the sentence to be segmented is determined as the proper noun.

In the embodiment of the invention, the labels of the first feature vector corresponding to the single character, which are used for representing the single character, are the weights of the beginning character, the middle character, the ending character and the single character phrase respectively, and can be 4-dimensional feature vectors; the second characteristic vector corresponding to the two characters is used for representing that when each single character in the two characters is combined with another single character, the label of each single character is the weight of the beginning character, the middle character, the ending character and the single character phrase respectively, and the weight can be an 8-dimensional characteristic vector; and the third characteristic vector corresponding to the proper noun is used for representing that when each single word in the proper noun is respectively combined with other single words in the proper noun, the label of each single word is respectively the weight of the initial word, the middle word, the ending word and the single word phrase, and the dimension of the weight is related to the number of the single words in the proper noun.

In the concrete implementation, a transmitting matrix dictionary corresponding to the target language type can be inquired to obtain a first feature vector corresponding to each single word, a second feature vector corresponding to two words and a third feature vector corresponding to a proper noun in the sentence to be segmented.

Correspondingly, before step 102, the method may further include:

an emission matrix dictionary corresponding to the target language type is obtained.

In the embodiment of the invention, the transmitting matrix dictionary corresponding to different language types can be obtained by training the training corpora of different language types through the structured sensing machine. The corpus may be obtained by manually labeling a large amount of corpora, or may be obtained by performing word segmentation processing on a large amount of corpora based on a statistical unsupervised word segmentation model or other word segmentation models with high word segmentation accuracy, which is not limited herein.

Specifically, the transmission matrix dictionary may include a feature vector of each word appearing in the training corpus corresponding to the transmission matrix dictionary, where the feature vector of each word may be a 4-dimensional feature vector; the transmitting matrix dictionary may also include the feature vectors of two-word phrases appearing in the training corpus corresponding thereto, and the feature vector of each two-word phrase may be an 8-dimensional feature vector; the emission matrix dictionary may also include feature vectors of proper nouns appearing in the corpus corresponding to the emission matrix dictionary. After the transmitting matrix dictionary corresponding to the target language type is obtained, the transmitting matrix dictionary is inquired, and a first feature vector corresponding to each single word, a second feature vector corresponding to two words and a third feature vector corresponding to a proper noun in the sentence to be segmented can be obtained.

It should be noted that, if the emission matrix dictionary corresponding to the target language type does not include the feature vector corresponding to a certain two words or proper nouns in the sentence to be segmented, the second feature vector corresponding to the two words or the third feature vector corresponding to the proper noun may be marked as 0.

In a specific example, the 4-dimensional feature vector of the word "i" is [ a 1a 2 A3 a4], where a1 to a4 are weights of the label of the word "i" in the chinese-type corpus, respectively, a beginning word, a middle word, an ending word, and a single word phrase, and the sum of the weights is 1.

In one specific example, the 8-dimensional two-character feature vector of the "like" phrase is [ B1B 2B 3B 4B 5B 6B 7B 8 ]. Wherein, B1-B4 are the labels of the 'favorite' word in the phrase 'like' in the Chinese type training corpus respectively are the weight of the beginning word, the middle word, the ending word and the single word phrase; b5 to B8 are the labels of "huan" character in the phrase "like" in the chinese type training corpus, respectively, the labels of the "huan" character are the weights of the beginning character, the middle character, the ending character and the single character phrase, and the sum of the weights corresponding to each single character is 1, i.e. the sum of the weights in the two character feature vector is 2.

In a specific example, the 12-dimensional three-character feature vector of the proper noun "Tiananmen" is [ B1B 2B 3B 4B 5B 6B 7B 8B 9B 10B 11B 12 ]. Wherein, B1-B4 are respectively the weight values of the labels of the "Tian" character in the phrase "Tiananmen" in the Chinese type training corpus, namely the initial character, the middle character, the end character and the single character phrase; B5-B8 are respectively the labels of the 'an' word in the phrase 'Tiananmen' in the Chinese type training corpus, and are respectively the weight values of the beginning word, the middle word, the ending word and the single word phrase; B9-B12 are the labels of "gate" in the phrase "Tiananmen" in the Chinese type training corpus respectively are the weights of the beginning word, the middle word, the ending word and the single word phrase, and the sum of the weights corresponding to each single word is 1, that is, the sum of the weights in the three-word feature vector is 3.

It should be noted that, in the process of processing the sentence to be segmented, only the proper noun dictionary and the transmission matrix dictionary which are consistent with the language type of the sentence to be segmented can return the processing result to the segmentation processing device, and other proper noun dictionaries and the transmission matrix dictionary do not return the processing result or return the null result.

Therefore, in this embodiment, when obtaining the sentence to be segmented, the sentence to be segmented may be matched with different proper noun dictionaries corresponding to different language types at the same time, and the emission matrix dictionaries corresponding to different language types are queried at the same time, so as to determine the target language type corresponding to the sentence to be segmented according to the proper noun dictionary returning the matching result or the language type corresponding to the emission matrix dictionary returning the query result, and determine the proper noun in the sentence to be segmented, the first feature vector corresponding to each single word, the second feature vector corresponding to two words, and the third feature vector corresponding to the proper noun in the sentence to be segmented according to the returned result.

By the method, the target language type corresponding to the sentence to be segmented does not need to be determined first after the sentence to be segmented is obtained, and then each feature vector is determined further according to the target language type, but the target language type is determined while each feature vector is determined, so that the speed of word segmentation processing is improved.

Step 103, determining a current fourth feature vector of each word according to the first feature vector, the second feature vector and the third feature vector.

The fourth feature vector may be a 4-dimensional feature vector, and the sum of the weights is 1.

Specifically, the current fourth feature vector of each single character can be obtained by linearly superimposing the first feature vector, the second feature vector and the third feature vector.

It should be noted that, because the dimensions of the first feature vector, the second feature vector and the third feature vector may be different, and the dimension of the first feature vector is usually smaller than the dimensions of the second feature vector and the third feature vector, when linearly superimposing the first feature vector, the second feature vector and the third feature vector, the feature vector having the same dimension as the first feature vector may be extracted from the second feature vector and the third feature vector, respectively, and then linearly superimposed with the first feature vector to obtain the fourth feature vector. When extracting the feature vector from the second feature vector or the third feature vector, the specific position of the single character in the two characters or the special name word needs to be combined for extraction.

For example, taking the feature vector with the same dimension as the first feature vector extracted from the second feature vector as an example, assuming that the sentence to be segmented includes "like", where the second feature vector of "like" is [ B1B 2B 3B 4B 5B 6B 7B 8], and the first feature vector of "like" is [ B9B 10B 11B 12], since the labels of "like" words in the phrases B1 to B4 are "like", respectively, the labels of the "like" words are the weights of the beginning word, the middle word, the ending word, and the single word phrase, then [ B1B 2B 3B 4] can be extracted from the second feature vector for determining the fourth feature vector of "like".

In addition, since the sum of the weights in the fourth feature vector is 1, when the first feature vector, the second feature vector and the third feature vector are linearly superimposed, the weights at corresponding positions may be added after multiplying the feature vectors by a preset weight, or the sum of the weights in the generated fourth feature vector may be 1 by performing normalization processing after linearly superimposing the first feature vector, the second feature vector and the third feature vector. The preset weight may be set as needed, and the sum of the weights in the fourth feature vector may be 1.

For example, assume that the sentence to be segmented includes the proper noun "Tiananmen", where the first feature vector corresponding to the "Tiananmen" is [ C1C 2C 3C 4], "Anmen" corresponds to the second feature vector is [ D1D 2D 3D 4D 5D 6D 7D 8], and "Tiananmen" corresponds to the third feature vector is [ B1B 2B 3B 4B 5B 6B 7B 8B 9B 10B 11B 12 ].

Since the labels of the "gate" words in the phrase "ann gate" from D5 to D8 are the weights of the beginning word, the middle word, the ending word and the single word phrase, respectively, and the labels of the "gate" words in the phrase "tianan gate" from B9 to B12 are the weights of the beginning word, the middle word, the ending word and the single word phrase, respectively, it is possible to extract [ D5D 6D 7D 8] from the second feature vector, [ B9B 10B 11B 12] from the third feature vector, and then obtain the current fourth feature vector [ E4] of the "gate" words according to [ C1C 2C 3C 4], [ D4 ] and [ B4] respectively.

Wherein, E1 ═ C1 ═ 0.4+ D5 × (0.3 + B9) × (0.3), E2 ═ C2 ═ 0.4+ D6 × (0.3 + B10 × (0.3), E3 ═ C3 ═ 0.4+ D7 ═ 0.3+ B11 × (0.3 + B12 ═ 0.3, and E4 ═ C4 ═ 0.4+ D8 × (0.3 + B12 ═ 0.3, and 0.4, 0.3, and 0.3 are preset weights respectively.

It should be noted that in the sentence to be segmented, each word may or may not be a word included in the proper noun. For example, when the sentence to be segmented is "i go on himalayas", i "and" go "are not the individual characters included in the proper noun" himalayas ", and" xi "," ma "," la "," ya ", and" mountain "are the individual characters included in the proper noun" himalayas ". In the embodiment of the invention, if a word is not a word contained in a proper noun, the current fourth feature vector of the word is determined by only using the first feature vector corresponding to the word and the second feature vector corresponding to two words containing the word. And only when the single character is the single character contained in the proper noun, determining the current fourth feature vector of the single character by using the first feature vector corresponding to the single character, the second feature vector corresponding to the two characters containing the single character and the third feature vector containing the proper noun of the single character.

In addition, because the first single character, namely the first character, in the sentence to be segmented can only form two characters with the second single character, the last single character, namely the last single character, can only form two characters with the single character in front of the last single character, and other single characters except the first character and the last single character, namely the single character in the middle position can respectively form two characters with the single characters in front of and behind the first character and the last single character, when the current fourth feature vector of each single character is determined according to the second feature vector, the first character and the last character are based on one second feature vector, and the middle character is based on two second feature vectors.

That is, when determining the current fourth feature vector for each word, all feature vectors associated with the word are referenced.

For example, assume that the sentence to be segmented is "i am on himalayas", wherein "himalayas" is a proper noun. When the current fourth feature vector of the 'I' word is determined, the first feature vector corresponding to the 'I' word and the second feature vector corresponding to the 'I' word are determined; when the current fourth feature vector of the 'login' word is determined, the first feature vector corresponding to the 'login' word, the second feature vector corresponding to the 'I login' word and the second feature vector corresponding to the 'Xilogin' word are determined; when the current fourth feature vector of the 'happiness' character is determined, the first feature vector corresponding to the 'happiness' character, the second feature vector corresponding to the 'climbing happiness' character, the second feature vector corresponding to the 'horse-like' character and the third feature vector corresponding to the 'Himalayashan' character are used as the basis.

As can be understood by those skilled in the art, by using the structured sensor to train the corpus of different language types, not only the feature vectors of a single word and the feature vectors of two words in the training corpus of different language types can be obtained, but also a plurality of multi-word feature vectors such as three-word feature vectors, four-word feature vectors, and the like can be obtained. Since it is necessary to refer to the numerical values of all the feature vectors associated with each individual character when determining the current fourth feature vector of each individual character, if the number of feature vectors is too large, the processing speed of word segmentation is greatly reduced. Therefore, in this embodiment, on the premise of comprehensively considering the calculation speed and the calculation accuracy, the current fourth feature vector of each single word may be determined only according to the first feature vector corresponding to each single word, the second feature vector corresponding to two words, and the third feature vector corresponding to a proper noun in the sentence to be segmented.

And 104, performing word segmentation on the sentence to be segmented according to the preset Chinese character label transfer matrix and the current fourth feature vector of each single character.

The different language types correspond to different Chinese character label transfer matrixes, and the Chinese character label transfer matrixes of the different language types can be obtained by training the training corpus of the different language types through a structured sensing machine. The corpus may be obtained by manually labeling a large amount of corpora, or may be obtained by performing word segmentation processing on a large amount of corpora based on a statistical unsupervised word segmentation model or other word segmentation models with high word segmentation accuracy, which is not limited herein.

Specifically, the Chinese character label transition matrix is a4 × 4 matrix, and the numerical values indicate transition probabilities between Chinese character labels. The Chinese character labels are four labels of beginning character, middle character, ending character and single character phrase, which are respectively represented by b, m, e and s. Four rows in the Chinese character label transfer matrix sequentially correspond to a beginning character, a middle character, an ending character and a single character phrase from top to bottom, and four columns also sequentially correspond to the beginning character, the middle character, the ending character and the single character phrase from left to right. For example, the values in the fourth column of the second row of the Chinese character label transition matrix represent the probability of transitioning from a "middle word" to a "single word phrase".

Specifically, after the markov decoding processing is performed on the Chinese character label transfer matrix corresponding to the target language type and the current fourth feature vector of each single word, the sequence labeling result corresponding to the sentence to be segmented can be determined, so that the sentence to be segmented can be segmented according to the sequence labeling result.

It should be noted that, in the embodiment of the present invention, resources corresponding to different language types, such as a proper noun dictionary, an emission matrix dictionary, a chinese character label transfer matrix, and the like, may be set separately, and only resources corresponding to a certain language type are loaded in the mobile terminal as needed. Thereby saving the memory space of the mobile terminal.

In practical application, resources corresponding to other language types can be loaded in the mobile terminal according to needs. For example, whether resources such as proper noun dictionaries, emission matrix dictionaries, Chinese character label transfer matrices and the like corresponding to other language types are loaded or not can be determined according to the position information of the mobile terminal or the touch operation of the user, and then word segmentation processing is performed according to the loaded resources.

For example, assuming that the initial language type corresponding to the word segmentation processing device in the mobile terminal is a chinese type, in the using process of the mobile terminal, it is determined that the mobile terminal is located in the united states, and it can be determined that the user may need to perform word segmentation processing of an english type, and resources such as a proper noun dictionary, an emission matrix dictionary, a chinese character label transfer matrix and the like corresponding to the english type can be loaded, so that word segmentation processing can be performed when an english type sentence to be segmented is obtained.

In addition, in the embodiment of the present invention, when the user does not need the resources corresponding to other language types, the resources corresponding to other language types may be removed according to the location information of the mobile terminal or the touch operation of the user, so as to save the storage space of the mobile terminal.

The word segmentation processing method provided by the embodiment of the invention is further explained with reference to fig. 2.

Fig. 2 is a flowchart of a word segmentation processing method according to another embodiment of the present invention.

As shown in fig. 2, the method includes:

step 201, normalizing each character included in the sentence to be segmented to determine the Chinese character label to which each single character belongs.

It will be appreciated that the types of characters included in the sentence to be segmented may be different. For example, the sentence to be segmented may include both characters of chinese type and characters of english type; or, the sentence to be segmented may include both simplified characters and traditional characters; alternatively, the sentence to be segmented may include both full-angle characters and half-angle characters, and so on. In the embodiment of the present invention, the characters included in the sentence to be segmented may be normalized first, so that the types of the characters included in the sentence to be segmented are the same, and then the subsequent segmentation is performed. The characters in the sentence to be segmented are normalized, so that the accuracy and the reliability of the segmentation result can be improved.

It should be noted that, when performing normalization processing on each character included in the sentence to be participled, the types of each character may be unified into the character types to which most characters in the sentence to be participled belong.

Step 202, determining a target language type corresponding to the sentence to be segmented.

The detailed implementation process and principle of the step 202 may refer to the detailed description of the above embodiments, which is not repeated herein.

Step 203, determining a third feature vector of the proper noun with the confidence coefficient larger than the threshold value in the sentence to be segmented.

And 204, acquiring a first characteristic vector corresponding to each single word except a proper noun and a second characteristic vector corresponding to two words in the sentence to be segmented by inquiring the transmitting matrix dictionary corresponding to the target language type.

Step 205, determining a current fourth feature vector of each single word in the sentence to be segmented according to the first feature vector, the second feature vector and/or the third feature vector.

It should be noted that the proper noun included in the sentence to be segmented may be a word whose boundary is believed, i.e., the boundary is stable, such as "foolish mountain," or a word whose boundary is easily affected, i.e., the boundary is unstable, such as "identity card. If the current fourth feature vector of each individual word is determined by using the third feature vector corresponding to the special noun with unstable boundary, so as to perform word segmentation processing on the sentence to be segmented, the accuracy of the word segmentation result may be affected. Therefore, in order to improve the accuracy and reliability of the word segmentation result, in the embodiment of the present invention, when determining the current fourth feature vector of each single word, the third feature vector of the used proper noun may be only the third feature vector corresponding to the proper noun with a stable boundary.

In the concrete implementation, specific nouns, idioms and other words with stable boundaries in proper noun dictionaries of different language types can be labeled in advance, so that the proper nouns with stable boundaries in the sentence to be participated can be determined by inquiring the proper noun dictionary corresponding to the target language type, the third feature vector corresponding to the proper nouns with stable boundaries can be obtained at the same time, and then the current fourth feature vector of each single word can be determined by utilizing the third feature vector so as to carry out word segmentation processing.

Or, the preset proper noun recognition model can be used for recognizing the proper nouns in the sentence to be participated, the confidence coefficient of the proper nouns is output, the proper nouns with stable boundaries in the sentence to be participated are determined, and then the current fourth feature vector of each single word is determined by using the third feature vector corresponding to the proper nouns with stable boundaries, so as to perform the word segmentation processing.

The confidence coefficient is used for representing the stability degree of the boundary of the proper noun.

Specifically, a confidence threshold value can be preset, the confidence of each proper noun in the proper noun dictionaries of different language types can be predetermined, so that the proper nouns with the confidence higher than the preset threshold value in the sentence to be participled can be determined by inquiring the proper noun dictionary corresponding to the target language type, the third feature vector corresponding to the proper nouns with the confidence higher than the preset threshold value is determined, and then the current fourth feature vector of each single word in the sentence to be participled is determined by using the third feature vector, so as to perform the word segmentation processing.

The detailed implementation process and principle of the step 204-205 can refer to the detailed description of the above embodiments, and are not described herein again.

It should be noted that, since the boundary of the proper noun with the confidence degree greater than the threshold is stable, when determining the current fourth feature vector of each word included in the proper noun, the determination may be made only based on the third feature vector of each word included in the proper noun, and the current fourth feature vectors of the other words except the proper noun may be determined based on the current first feature vector of each word and the current second feature vector of the two words. Therefore, when the first characteristic vector corresponding to each single character and the second characteristic vector corresponding to two characters in the sentence to be segmented are obtained, only the first characteristic vector of each single character except the proper noun and the second characteristic vector of each two characters except the proper noun in the sentence to be segmented are needed to be obtained.

For example, assume that the sentence to be segmented is "i am on himalayas", wherein "himalayas" is a proper noun. When the current fourth feature vector of the 'me' word is determined, the fourth feature vector can be determined according to the first feature vector corresponding to the 'me' word and the second feature vector corresponding to the 'me' word; when the current fourth feature vector of the 'login' word is determined, the fourth feature vector can be determined according to the first feature vector corresponding to the 'login' word, the second feature vector corresponding to the 'I login' word and the second feature vector corresponding to the 'login-like' word; when the current fourth feature vector of the 'happiness' character is determined, the fourth feature vector can be determined according to the third feature vector corresponding to the 'Himalayan mountain'; when the current fourth feature vector of the Chinese character 'ma' is determined, the fourth feature vector can be determined according to the third feature vector corresponding to the Himalayan mountain.

And step 206, performing Markov decoding processing according to the preset Chinese character label transfer matrix and the current fourth feature vector of each single character, and determining a sequence labeling result corresponding to the sentence to be segmented.

The preset chinese character label transfer matrix and the process of acquiring the same may refer to the description of the above embodiments, and are not described herein again.

Specifically, the fourth feature vector is a 4-dimensional feature vector, and the Chinese character label transfer matrix is a4 × 4 matrix, so that the current fourth feature vector of each single character is multiplied by the Chinese character label transfer matrix corresponding to the target language type to obtain a 4-dimensional label vector corresponding to each single character at present, and the sentence to be participated can be subjected to sequence labeling according to the current label vector corresponding to each single character, so that the sentence to be participated can be subjected to word segmentation according to the sequence labeling result.

The label vector is used for representing the labels of each single character and is the weight of the beginning character, the middle character, the ending character and the single character phrase respectively.

For example, assuming that the sentence to be segmented includes the single words a, b, c, d, and e, after the fourth feature vector of the single words a, b, c, d, and e is multiplied by the label transfer matrix of the chinese character to obtain the label vector, it is determined that the weights of the words a, b, and e are larger, the weights of the words c and d are smaller, the weight of the word c and d is larger, the weight of the word c is larger, and the weight of the word d is larger, and the word to be segmented is labeled as "a/b/cd/e".

And step 207, correcting the sequence labeling result according to a preset proper noun dictionary and a segmentation rule.

The term dictionary of different language types may be obtained by manually labeling a large number of corpora of different language types, or by using a classification model, which is not limited herein.

The segmentation rule is a rule for specifying whether to segment a specific entry in a sentence to be segmented. For example, it may be specified that "https:// www." and its following characters are not sliced when "https:// www." is included in the to-be-participated sentence, or it may be specified that floating point numbers are not sliced when floating point numbers are included in the to-be-participated sentence, and so on.

It can be understood that, when the current fourth feature vector of each single word is determined by using the third feature vector corresponding to the proper noun in the sentence to be segmented, the third feature vector corresponding to the proper noun whose boundary may be stable is used, which may cause an error in labeling result of the proper noun whose boundary is unstable in the sequence labeling result corresponding to the sentence to be segmented. For example, the sentence to be segmented includes the proper nouns "himalayas" and "identity card", and "himalayas" is a word with a stable boundary, and "identity card" is a word with an unstable boundary, and then the fourth feature vector is determined according to each first feature vector, each second feature vector, and each third feature vector of "himalayas", and in the obtained sequence labeling result, "himalayas" is not segmented, and "identity card" may be segmented into "identity" and "identity card".

Therefore, in the embodiment of the invention, after the sequence labeling result corresponding to the sentence to be segmented is determined, the sequence labeling result can be corrected according to the preset proper noun dictionary corresponding to the target language type and the segmentation rule, so as to improve the accuracy of the segmentation result.

In the specific implementation, words with unstable boundaries in the proper noun dictionaries of different language types can be labeled, and if the words with unstable boundaries labeled in the proper noun dictionary are segmented in the sequence labeling result, the sequence labeling result can be corrected according to the preset proper noun dictionary corresponding to the target language type.

For example, if the identity card is labeled as a proper noun with unstable boundary in the Chinese-type proper noun dictionary, and the identity card is divided into the identity card and the identity card in the sequence labeling result, the sequence labeling result can be corrected without dividing the identity card according to the preset Chinese-type proper noun dictionary.

Or, the proper noun recognition model is used for recognizing the proper nouns and the corresponding confidence degrees thereof in the sentence to be segmented, and for the proper nouns with low confidence degrees, the sequence labeling result is corrected after the sequence labeling result is determined.

In addition, if the words which are specified by the segmentation rule and are not suitable for segmentation are segmented in the sequence labeling result, the sequence labeling result can be corrected according to the segmentation rule, so that the word segmentation processing result is more accurate and reliable.

It should be noted that, in the embodiment of the present invention, the proper noun dictionary corresponding to the target language type used for determining the current fourth feature vector of each word and correcting the sequence tagging result may be the same proper noun dictionary, only words with stable and unstable boundaries are marked, or the confidence level of each proper noun in the proper noun dictionary is marked. Or, the present fourth feature vector of each word is determined, and the proper noun dictionary corresponding to the target language type used when the sequence labeling result is corrected may also be different proper noun dictionaries, where the different dictionaries respectively include proper nouns with stable boundaries and unstable boundaries, which is not limited in the present application.

The word segmentation processing method of the embodiment of the invention comprises the steps of normalizing each character in a sentence to be segmented, determining a Chinese character label to which each single character belongs, determining a target language type corresponding to the sentence to be segmented, obtaining a first characteristic vector corresponding to each single character in the sentence to be segmented, a second characteristic vector corresponding to two characters and a third characteristic vector corresponding to a proper noun with the confidence level higher than a threshold value in the sentence to be segmented by inquiring a transmitting matrix dictionary corresponding to the target language type, determining a current fourth characteristic vector of each single character according to the first characteristic vector, the second characteristic vector and the third characteristic vector, performing Markov decoding according to a preset Chinese character label transfer matrix and the current fourth characteristic vector of each single character, determining a sequence labeling result corresponding to the sentence to be segmented, and finally performing a proper noun dictionary and a segmentation rule according to the preset proper noun, and correcting the sequence labeling result. Therefore, word segmentation processing of the to-be-segmented sentences is achieved according to the target language types corresponding to the to-be-segmented sentences, the accuracy of word segmentation of the to-be-segmented sentences of various language types is improved, appropriate resources can be loaded according to needs, the storage space of the mobile terminal is saved, and user experience is improved.

Fig. 3 is a schematic structural diagram of a word segmentation processing device according to an embodiment of the present invention.

As shown in fig. 3, the word segmentation processing device includes:

the first determining module 31 is configured to determine a target language type corresponding to a sentence to be segmented when the sentence to be segmented is obtained;

the first obtaining module 32 is configured to obtain, according to the target language type, a first feature vector corresponding to each single word in the sentence to be segmented, a second feature vector corresponding to two words, and a third feature vector corresponding to a proper noun in the sentence to be segmented, respectively;

a second determining module 33, configured to determine a current fourth feature vector of each individual character according to the first feature vector, the second feature vector, and the third feature vector;

the first processing module 34 is configured to perform word segmentation on the sentence to be word segmented according to the preset Chinese character label transfer matrix and the current fourth feature vector of each single word.

Specifically, the word segmentation processing device provided by the embodiment of the present invention may execute the word segmentation processing method provided by the embodiment of the present invention, and the device may be configured in any mobile terminal to perform word segmentation processing on a to-be-segmented sentence.

In a possible implementation form of the embodiment of the present application, the first processing module 34 is specifically configured to:

and performing Markov decoding processing on the preset Chinese character label transfer matrix and the current fourth characteristic vector of each single word to determine a sequence labeling result corresponding to the sentence to be segmented.

In another possible implementation form of the embodiment of the present application, the first obtaining module 32 is specifically configured to:

and acquiring a first characteristic vector corresponding to each single character in the sentence to be segmented by inquiring the transmitting matrix dictionary corresponding to the target language type.

It should be noted that the foregoing explanation on the embodiment of the word segmentation processing method is also applicable to the word segmentation processing apparatus of this embodiment, and is not repeated here.

Fig. 4 is a schematic structural diagram of a word segmentation processing device according to another embodiment of the present invention.

As shown in fig. 4, on the basis of fig. 3, the word segmentation processing apparatus further includes:

the second processing module 41 is configured to perform normalization processing on each character included in the sentence to be segmented, and determine a Chinese character tag to which each single character belongs.

And the third processing module 42 is configured to correct the sequence labeling result according to a preset proper noun dictionary and a segmentation rule.

And a third determining module 43, configured to determine that the confidence of the proper noun in the sentence to be segmented is greater than a threshold.

And a second obtaining module 44, configured to obtain the transmission matrix dictionary corresponding to the target language type.

As shown in fig. 5, the mobile terminal includes:

a memory 51, a processor 52 and a computer program stored on the memory 51 and executable on the processor 52.

The processor 52 implements the word segmentation processing method provided in the above-described embodiment when executing the program.

The mobile terminal can be a computer, a mobile phone, a wearable device and the like.

Further, the mobile terminal further includes:

a communication interface 53 for communication between the memory 51 and the processor 52.

A memory 51 for storing a computer program operable on the processor 52.

The memory 51 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.

And a processor 52, configured to implement the word segmentation processing method according to the foregoing embodiment when executing the program.

If the memory 51, the processor 52 and the communication interface 53 are implemented independently, the communication interface 53, the memory 51 and the processor 52 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (Extended Industry Standard Architecture) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 5, but this does not mean only one bus or one type of bus.

Alternatively, in practical implementation, if the memory 51, the processor 52 and the communication interface 53 are integrated on one chip, the memory 51, the processor 52 and the communication interface 53 may complete communication with each other through an internal interface.

Processor 52 may be a Central Processing Unit (CPU), or an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present invention.

A fourth aspect embodiment of the present invention proposes a computer-readable storage medium on which a computer program is stored, which, when executed by a processor, implements a word segmentation processing method as in the preceding embodiments.

A fifth embodiment of the present invention provides a computer program product, wherein when the instructions in the computer program product are executed by a processor, the word segmentation processing method as in the foregoing embodiments is performed.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "a plurality" means at least two, e.g., two, three, etc., unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present invention.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present invention may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc. Although embodiments of the present invention have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present invention, and that variations, modifications, substitutions and alterations can be made to the above embodiments by those of ordinary skill in the art within the scope of the present invention.

Claims

1. A word segmentation processing method, comprising:

when a sentence to be segmented is obtained, determining a target language type corresponding to the sentence to be segmented;

according to the target language type, respectively obtaining a first characteristic vector corresponding to each single word in a sentence to be segmented, a second characteristic vector corresponding to two words and a third characteristic vector corresponding to a proper noun in the sentence to be segmented, wherein the first characteristic vector represents the weight of each single word and is respectively a beginning word, a middle word, an ending word and a single word phrase, when the second characteristic vector represents each single word in the two words and is combined with another single word, the label of each single word is respectively the weight of the beginning word, the middle word, the ending word and the single word phrase, and when the third characteristic vector represents each single word in the proper noun is respectively combined with a single word except each single word in the proper noun, the label of each single word is respectively the weight of the beginning word, the middle word, the ending word and the single word phrase;

determining a current fourth feature vector of each word according to the first feature vector, the second feature vector and the third feature vector, wherein the fourth feature vector has the same dimension as the first feature vector, and the determining the current fourth feature vector of each word comprises: linearly superposing the first feature vector, the second feature vector and the third feature vector to obtain a fourth feature vector;

and performing word segmentation on the sentence to be word segmented according to a preset Chinese character label transfer matrix and the current fourth feature vector of each single character.

2. The method of claim 1, wherein before determining the first feature vector corresponding to each single word in the sentence to be segmented, the method further comprises:

and carrying out normalization processing on each character included in the sentence to be segmented, and determining the Chinese character label to which each single character belongs.

3. The method of claim 1, wherein the subjecting the sentence to be participled to a participle process comprises:

performing Markov decoding processing on the preset Chinese character label transfer matrix and the current fourth feature vector of each single character to determine a sequence labeling result corresponding to the sentence to be segmented;

after the word segmentation processing is performed on the sentence to be segmented, the method further comprises the following steps:

and correcting the sequence labeling result according to a preset proper noun dictionary and a segmentation rule.

4. The method according to any one of claims 1 to 3, wherein before obtaining the third feature vector corresponding to the proper noun in the sentence to be participated, the method further comprises:

and determining that the confidence coefficient of the proper noun in the sentence to be participated is greater than a threshold value.

5. The method according to any one of claims 1 to 3, wherein the obtaining of the first feature vector corresponding to each single word in the sentence to be segmented comprises:

6. The method of claim 5, wherein before obtaining the first feature vector corresponding to each single word in the sentence to be segmented, the method further comprises:

and acquiring an emission matrix dictionary corresponding to the target language type.

7. A word segmentation processing apparatus, comprising:

the system comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is used for determining a target language type corresponding to a sentence to be segmented when the sentence to be segmented is obtained;

a first obtaining module, configured to obtain, according to the target language type, a first feature vector corresponding to each single word in a sentence to be segmented, a second feature vector corresponding to two words, and a third feature vector corresponding to a proper noun in the sentence to be segmented, wherein, the labels of the first characteristic vector representing the single characters are respectively the weight of the beginning character, the middle character, the ending character and the single character phrase, when the second characteristic vector represents that each single character in the two characters is combined with another single character, the label of each single character is the weight of the initial character, the middle character, the ending character and the single character phrase respectively, when the third feature vector characterizes that each word in the proper noun is respectively combined with the words except each word in the proper noun, the label of each single character is the weight of the initial character, the middle character, the ending character and the single character phrase respectively;

a second determining module, configured to determine a current fourth feature vector of each word according to the first feature vector, the second feature vector, and the third feature vector, where the dimension of the fourth feature vector is the same as that of the first feature vector, and the second determining module is specifically configured to: linearly superposing the first feature vector, the second feature vector and the third feature vector to obtain a fourth feature vector;

and the first processing module is used for carrying out word segmentation processing on the sentence to be word segmented according to a preset Chinese character label transfer matrix and the current fourth feature vector of each single word.

8. The apparatus of claim 7, further comprising:

and the second processing module is used for carrying out normalization processing on each character included in the sentence to be segmented and determining the Chinese character label of each single character.

9. The apparatus of claim 7, wherein the first processing module is specifically configured to:

the device, still include:

and the third processing module is used for correcting the sequence labeling result according to a preset proper noun dictionary and a segmentation rule.

10. The apparatus of any of claims 7-9, further comprising:

and the third determining module is used for determining that the confidence coefficient of the proper noun in the sentence to be participated is greater than a threshold value.

11. The apparatus of any one of claims 7-9, wherein the first obtaining module is specifically configured to:

12. The apparatus of claim 11, further comprising:

and the second acquisition module is used for acquiring the transmitting matrix dictionary corresponding to the target language type.

13. A mobile terminal, comprising:

memory, processor and computer program stored on the memory and executable on the processor, characterized in that the processor implements the segmentation processing method according to any one of claims 1 to 6 when executing the program.

14. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of word segmentation processing according to any one of claims 1 to 6.