CN111695350B

CN111695350B - Word segmentation method and word segmentation device for text

Info

Publication number: CN111695350B
Application number: CN201910195387.3A
Authority: CN
Inventors: 陈坦访; 王伟玮; 李奘
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2019-03-14
Filing date: 2019-03-14
Publication date: 2023-12-08
Anticipated expiration: 2039-03-14
Also published as: CN111695350A

Abstract

The application relates to the technical field of Chinese text processing, in particular to a word segmentation method and a word segmentation device for texts. The word segmentation method comprises the following steps: the method comprises the steps of obtaining a Chinese text to be processed, and segmenting the Chinese text into a plurality of Chinese short texts, wherein each Chinese short text comprises a plurality of continuous Chinese characters representing a semantic meaning, so that the length of the Chinese text can be reduced, interference of non-Chinese characters can be filtered, further, based on the segmented Chinese short texts and a pre-trained Chinese word segmentation model, the Chinese text subjected to word segmentation is output, and the word segmentation efficiency of the Chinese text can be improved.

Description

Word segmentation method and word segmentation device for text

Technical Field

The application relates to the technical field of Chinese text processing, in particular to a word segmentation method and a word segmentation device for texts.

Background

In various scenes of Chinese natural language processing, we usually need to study by taking words as the minimum basic unit, but Chinese is in units of characters, and marks such as spaces between words indicate the boundaries of words, so word segmentation becomes basic work of Chinese text processing, and the quality of word segmentation plays an extremely critical role in subsequent Chinese information processing.

At present, a word segmentation method based on statistics, which is represented by a hidden Markov model (Hidden Mark Model, HMM), is adopted for Chinese word segmentation, a dynamic programming algorithm is used for carrying out word sequence labeling on texts to be segmented, however, in a massive data environment, a large number of long texts need to be labeled in the methods, and the time cost is high.

Disclosure of Invention

Accordingly, an object of the embodiments of the present application is to provide a word segmentation method and a word segmentation device for text, which are used for improving the word segmentation efficiency of chinese text.

Mainly comprises the following aspects:

in a first aspect, an embodiment of the present application provides a word segmentation method for a text, where the word segmentation method includes:

acquiring a Chinese text to be processed;

dividing the Chinese text into a plurality of Chinese short texts; wherein each of said chinese short text comprises a plurality of chinese characters in succession representing a semantic meaning;

based on the segmented Chinese short texts and a pre-trained Chinese word segmentation model, outputting the segmented Chinese texts.

In one possible implementation, the splitting the chinese text into a plurality of chinese short texts includes:

inputting the Chinese text into a predefined regular expression for text segmentation to obtain a plurality of Chinese short texts;

Wherein the regular expression is used for filtering out a plurality of Chinese characters which are isolated by non-Chinese characters and are continuous in the Chinese text as a whole.

In one possible implementation manner, after the chinese text is cut into a plurality of chinese short texts, the method further includes:

extracting the characteristics of each Chinese character in the Chinese short text to obtain the characteristic vector of each Chinese character;

the method for outputting the segmented Chinese text based on the segmented Chinese short texts and the pre-trained Chinese word segmentation model comprises the following steps:

and inputting the feature vector of each Chinese character in each Chinese short text into the Chinese word segmentation model, and outputting the segmented Chinese short text.

In one possible implementation manner, the feature extraction of each chinese character in the chinese short text to obtain a feature vector of each chinese character includes:

extracting the characteristics of each Chinese character in the Chinese short text to obtain a character vector, a position vector and a stroke order vector of the Chinese character;

and carrying out weighted summation on the character vector, the position vector and the stroke order vector of the Chinese character to obtain a characteristic vector of the Chinese character.

In one possible implementation, the chinese word segmentation model includes a converter-based bi-directional encoder BERT and a conditional random field CRF;

inputting the feature vector of each Chinese character in each Chinese short text into the Chinese word segmentation model, and outputting the segmented Chinese short text, wherein the method comprises the following steps:

inputting the feature vector of each Chinese character in the Chinese short text into the bi-directional encoder BERT based on the converter aiming at each Chinese short text to obtain the global information vector of each Chinese character in the Chinese short text;

and inputting the global information vector of each Chinese character in the Chinese short text into the conditional random field CRF, and outputting the segmented Chinese short text.

In a possible implementation manner, the inputting the global information vector of each chinese character in the chinese short text into the conditional random field CRF, and outputting the segmented chinese short text includes:

inputting the global information vector of each Chinese character in the Chinese short text into the conditional random field CRF, and determining word position labeling information corresponding to each Chinese character in the Chinese short text;

And outputting the segmented Chinese short text according to word position labeling information corresponding to each Chinese character in the Chinese short text.

In one possible implementation manner, the inputting the global information vector of each chinese character in the chinese short text into the conditional random field CRF, determining word position labeling information corresponding to each chinese character in the chinese short text, includes:

inputting the global information vector of each Chinese character in the Chinese short text into the conditional random field CRF, and determining the probability of each Chinese character in the Chinese short text at a preset target position;

and determining word position labeling information corresponding to each Chinese character in the Chinese short text according to the probability of each Chinese character in the Chinese short text in the preset target position.

In a possible implementation manner, the determining word position labeling information corresponding to each chinese character in the chinese short text according to the probability that each chinese character in the chinese short text is at the preset target position includes:

And determining word position labeling information corresponding to each Chinese character in the Chinese short text according to the maximum probability of the probability of each Chinese character in the Chinese short text in the preset target position.

and determining word position labeling information corresponding to each Chinese character in the Chinese short text according to the probability of each Chinese character in the Chinese short text at the preset target position and the position of each Chinese character in the corresponding Chinese short text.

In one possible embodiment, the preset target position includes:

the starting position of a word, the middle position of a word, the ending position of a word, and the position of a non-word.

In a second aspect, an embodiment of the present application further provides a word segmentation apparatus for a text, where the word segmentation apparatus includes:

the acquisition module is used for acquiring the Chinese text to be processed;

the segmentation module is used for segmenting the Chinese text into a plurality of Chinese short texts; wherein each of said chinese short text comprises a plurality of chinese characters in succession representing a semantic meaning;

And the output module is used for outputting the segmented Chinese text based on the segmented Chinese short texts and a pre-trained Chinese word segmentation model.

In a possible implementation manner, the segmentation module is specifically configured to segment the chinese text into a plurality of chinese short texts according to the following steps:

In one possible implementation manner, the word segmentation device further comprises an extraction module;

the extraction module is used for extracting the characteristics of each Chinese character in the Chinese short text to obtain the characteristic vector of each Chinese character;

the output module is used for outputting the Chinese text after word segmentation according to the following steps:

In one possible implementation manner, the extracting module is specifically configured to extract the feature vector of each chinese character according to the following steps:

the output module is specifically configured to output the segmented chinese short text according to the following steps:

In one possible implementation, the output module includes a determination module;

the determining module is used for inputting the global information vector of each Chinese character in the Chinese short text into the conditional random field CRF and determining word position labeling information corresponding to each Chinese character in the Chinese short text;

the output module is further used for outputting the Chinese short text after word segmentation according to word position marking information corresponding to each Chinese character in the Chinese short text.

In one possible implementation manner, the determining module is configured to determine word position labeling information corresponding to each chinese character according to the following steps:

In a possible implementation manner, the determining module is further configured to determine word position labeling information corresponding to each chinese character according to the following steps:

In one possible embodiment, the preset target position includes:

In a third aspect, an embodiment of the present application further provides an electronic device, including: a processor, a memory and a bus, the memory storing machine readable instructions executable by the processor, the processor and the memory in communication via the bus when the electronic device is running, the machine readable instructions when executed by the processor performing the steps of the method of word segmentation of text as described in the first aspect or any of the possible implementations of the first aspect.

In a fourth aspect, the embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program is executed by a processor to perform the steps of the text word segmentation method described in the first aspect or any possible implementation manner of the first aspect.

According to the embodiment of the application, the Chinese text to be processed is obtained and segmented into a plurality of Chinese short texts, wherein each Chinese short text comprises a plurality of continuous Chinese characters representing a semantic meaning, so that not only can the length of the Chinese text be reduced, but also the interference of non-Chinese characters can be filtered, further, based on the segmented Chinese short texts and a pre-trained Chinese word segmentation model, the Chinese text after word segmentation is output, and the word segmentation efficiency of the Chinese text can be improved.

In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for word segmentation of text according to a first embodiment of the present application;

FIG. 2 is a flowchart of a text word segmentation method according to a second embodiment of the present application;

FIG. 3 is a functional block diagram of a text word segmentation device according to a third embodiment of the present application;

FIG. 4 is a second functional block diagram of a text word segmentation device according to a third embodiment of the present application;

fig. 5 shows a schematic structural diagram of an electronic device according to a fourth embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described with reference to the accompanying drawings in the embodiments of the present application, and it should be understood that the drawings in the present application are for the purpose of illustration and description only and are not intended to limit the scope of the present application. In addition, it should be understood that the schematic drawings are not drawn to scale. A flowchart, as used in this disclosure, illustrates operations implemented according to some embodiments of the present application. It should be appreciated that the operations of the flow diagrams may be implemented out of order and that steps without logical context may be performed in reverse order or concurrently. Moreover, one or more other operations may be added to or removed from the flow diagrams by those skilled in the art under the direction of the present disclosure.

In addition, the described embodiments are only some, but not all, embodiments of the application. The components of the embodiments of the present application generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the application, as presented in the figures, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by a person skilled in the art without making any inventive effort, are intended to be within the scope of the present application.

In order to enable those skilled in the art to make and use the present disclosure, the following embodiments are provided in connection with a particular application scenario "segmentation of chinese text", and it will be apparent to those skilled in the art that the general principles defined herein may be applied to other embodiments and applications scenarios without departing from the spirit and scope of the present disclosure.

The method, the device, the electronic equipment or the computer readable storage medium can be applied to any scene needing word segmentation processing, the embodiment of the application does not limit specific application scenes, and any scheme using the word segmentation method and the word segmentation device of the text provided by the embodiment of the application is within the protection scope of the application.

It should be noted that, before the present application proposes, most of the existing schemes are word segmentation methods of dictionary and word stock matching for whole text, or model-based word segmentation methods represented by HMM. The dictionary-based word segmentation method needs to manually construct a dictionary, and continuous updating of the dictionary is needed, so that great labor cost is needed; the statistical word segmentation method represented by the HMM is adopted, and a dynamic programming algorithm is used for carrying out word sequence labeling on texts to be segmented, however, in a massive data environment, a large number of long texts need to be labeled in the method, and the time cost is high.

In view of the above problems, according to the word segmentation method and word segmentation device for text provided by the embodiments of the present application, a chinese text to be processed is obtained, and is segmented into a plurality of chinese short texts, where each chinese short text includes a plurality of chinese characters representing a semantic meaning and being continuous, so that not only the length of the chinese text can be reduced, but also the interference of non-chinese characters can be filtered, and further, based on the segmented plurality of chinese short texts and a pre-trained chinese word segmentation model, the chinese text after word segmentation is output, so that the word segmentation efficiency of the chinese text can be improved.

It should be noted that, chinese word segmentation (Chinese Word Segmentation) refers to a process of segmenting a chinese character sequence into individual words, namely, recombining continuous word sequences into word sequences according to a certain specification, for example, word segmentation is performed on "i love blue sky and white cloud", and the obtained word segmentation result is "i/love/blue sky and white cloud".

A converter-based bi-directional encoder (Bidirectional Encoder Representations from Transformer, BERT) is a deep learning model that can capture the context of chinese.

The conditional random field (conditional random field, CRF) is an undirected graph model, and has good effects in sequence labeling tasks such as word segmentation, part-of-speech labeling, named entity recognition and the like in recent years.

In order to facilitate understanding of the present application, the following detailed description of the technical solution provided by the present application is provided in connection with specific embodiments.

Example 1

Referring to fig. 1, a flowchart of a text word segmentation method according to an embodiment of the present application includes the following steps:

s101: and obtaining the Chinese text to be processed.

In a specific implementation, a chinese text to be subjected to word segmentation may be obtained first.

In various scenes of chinese natural language processing, we generally need to study with words as the smallest basic unit, but chinese is in units of words, and no marks such as spaces between words indicate the boundaries of words, so word segmentation becomes the basic work of chinese text processing, and the quality of word segmentation plays an extremely critical role in the following chinese information processing.

S102: dividing the Chinese text into a plurality of Chinese short texts; wherein each of said chinese short text comprises a plurality of chinese characters in succession that characterize a semantic meaning.

In the implementation, after the Chinese text which needs word segmentation is obtained, the Chinese text is firstly segmented into a plurality of Chinese short texts, so that word segmentation is carried out by taking the Chinese short texts as units in the subsequent word segmentation link, the length of the Chinese text can be reduced, and the interference of non-Chinese characters can be filtered.

Here, each chinese short text is composed of a plurality of chinese characters that represent one semantic meaning and are consecutive, for example, chinese text is "i want to go to city, he want to go back to rural area". And cutting the Chinese text according to the preset rule to obtain two Chinese short texts, namely 'I want to go to the city' and 'I want to return to the rural area'.

S103: based on the segmented Chinese short texts and a pre-trained Chinese word segmentation model, outputting the segmented Chinese texts.

In a specific implementation, outputting the segmented chinese text based on the segmented chinese short texts and the pre-trained chinese word segmentation model includes two implementations:

embodiment one: the method comprises the steps of respectively inputting a plurality of Chinese short texts into a pre-trained Chinese word segmentation model to obtain a plurality of segmented Chinese short texts, and then splicing all segmented Chinese short texts to output the segmented Chinese texts.

Embodiment two: all Chinese short texts can be input into a pre-trained Chinese word segmentation model in parallel together so as to output the segmented Chinese text.

Here, the chinese word segmentation model is trained before the word segmentation of the chinese text, and can be directly used for word segmentation of the chinese text. The chinese word segmentation model may be a word segmentation model based on string matching, an understanding-based word segmentation model, a statistics-based word segmentation model, and so on.

In the embodiment of the application, the Chinese text is segmented into a plurality of Chinese short texts by acquiring the Chinese text to be processed, wherein each Chinese short text comprises a plurality of continuous Chinese characters representing a semantic meaning, the length of the Chinese text can be reduced, the interference of non-Chinese characters can be filtered, and further, the Chinese text after word segmentation is output based on the segmented Chinese short texts and a pre-trained Chinese word segmentation model, so that the word segmentation efficiency of the Chinese text can be improved.

In a possible implementation manner, the splitting the chinese text into a plurality of chinese short texts in S102 includes the following steps:

and inputting the Chinese text into a predefined regular expression for text segmentation to obtain a plurality of Chinese short texts.

In a specific implementation, after a Chinese text needing word segmentation is obtained, the Chinese text is input into a predefined regular expression for segmentation, so that a plurality of Chinese short texts are obtained. In this way, in the subsequent word segmentation step, word segmentation is performed by taking the Chinese short text as a unit, so that the length of the Chinese text can be reduced, and the interference of non-Chinese characters can be filtered.

Here, by researching the chinese text, it is found that the chinese block is a common text unit, and as for punctuation marks, foreign characters and the like in the chinese text, these non-chinese characters do not participate in word formation in chinese, i.e. they hardly play a role in word formation in chinese, but if these non-chinese characters are also input into the chinese word segmentation model together with chinese characters for word segmentation, the complexity of word segmentation by the model will be increased, and the time cost of word segmentation will also be increased. Considering that in the unified character encoding standard, the range of chinese characters is between u4E00 and u9FD5, individual chinese short texts in the chinese text can be cut out by creating a regular expression that divides the chinese characters and other non-chinese characters, specifically, continuous character blocks in the above range can be taken as a whole, continuous character blocks not in the above range can be taken as a word part, thereby converting the chinese text into a plurality of chinese short texts, for example, chinese text is "i want to go to city, he want to go back to rural area". "wherein" I want to go to city "and" he want to go back to rural "are consecutive character blocks in accordance with the unified character encoding standard ranging from u4E00 to u9FD5, and", "and". "is a character block that does not fit the range, and thus, inputting the above-described chinese text into a predefined regular expression results in two short texts," i want to go to city "and" he want to go back to rural ".

It should be noted that, the regular expression (Regular Expression, RE), also called regular expression, is a logic formula for operating a character string, or a "regular character string" is formed by using specific characters defined in advance and combinations of the specific characters, and the "regular character string" is used to express a filtering logic for the character string.

Example two

Referring to fig. 2, a flowchart of a text word segmentation method according to a second embodiment of the present application includes the following steps:

s201: and obtaining the Chinese text to be processed.

S202: dividing the Chinese text into a plurality of Chinese short texts; wherein each of said chinese short text comprises a plurality of chinese characters in succession that characterize a semantic meaning.

Here, each chinese short text is composed of a plurality of chinese characters that represent one semantic meaning and are consecutive, for example, chinese text is "i want to go north eastern, he want to go south. And cutting the Chinese text according to the preset rule to obtain two Chinese short texts, namely 'I want to go to northeast' and 'he needs to go to south'.

S203: and extracting the characteristics of each Chinese character in the Chinese short text to obtain the characteristic vector of each Chinese character.

In a specific implementation, after a plurality of Chinese short texts are obtained by segmentation, feature extraction can be performed on each Chinese character in the plurality of Chinese short texts so as to extract a feature vector corresponding to each Chinese character, so that the Chinese word segmentation model can recognize each Chinese character more efficiently.

S204: and inputting the feature vector of each Chinese character in each Chinese short text into the Chinese word segmentation model, and outputting the segmented Chinese short text.

In a specific implementation, inputting the feature vector of each Chinese character in each Chinese short text into a Chinese word segmentation model, and outputting the segmented Chinese text comprises two implementations:

embodiment one: the feature vectors of the Chinese short texts can be respectively input into a pre-trained Chinese word segmentation model to obtain a plurality of segmented Chinese short texts, then all segmented Chinese short texts are spliced, and the segmented Chinese short texts are output.

Embodiment two: the feature vectors of all Chinese short texts can be input into a pre-trained Chinese word segmentation model together, and the segmented Chinese texts are output.

Here, the chinese word segmentation model is trained before the word segmentation of the chinese text is performed, and the chinese word segmentation can be directly performed. The chinese word segmentation model may be a word segmentation model based on string matching, an understanding-based word segmentation model, a statistics-based word segmentation model, and so on.

In the embodiment of the application, the Chinese text to be processed is segmented into a plurality of Chinese short texts, and each Chinese character in the Chinese short texts is subjected to feature extraction to obtain the feature vector of each Chinese character, and further, the feature vector of each Chinese character in each Chinese short text is input into a Chinese word segmentation model, so that the segmented Chinese short texts can be output. By adopting the mode, the efficiency of word segmentation of the Chinese text can be improved by reducing the length of the Chinese text and filtering the interference of non-Chinese characters.

In a possible implementation manner, the extracting the feature of each chinese character in the chinese short text in S203 to obtain a feature vector of each chinese character includes the following steps:

In the implementation, firstly, extracting character features, position features and stroke order features of each Chinese character in a Chinese short text to obtain a character vector, a position vector and a stroke order vector corresponding to each Chinese character; further, the feature vector corresponding to each Chinese character can be obtained by carrying out the weighted summation of the vector of the character vector, the position vector and the stroke order vector corresponding to each Chinese character.

For an input chinese character, it is necessary to convert the input chinese character into a feature vector with chinese character features that can be recognized by a model, and in general, a character vector, a position vector, and a stroke order vector can be extracted from the chinese character. Considering that a Chinese character is pictographic, one Chinese character has a plurality of strokes, wherein the content and the sequence of the strokes can represent Chinese character information, for example, "big" can be disassembled into three strokes "-", "vertical", "horizontal", "vertical", and the information of characters, positions and stroke sequences is introduced into the input part of a Chinese word segmentation model in consideration of the characteristics of the Chinese pictographic.

In one possible implementation, the chinese word segmentation model includes a converter-based bi-directional encoder BERT and a conditional random field CRF.

Here, the chinese word segmentation model includes a bi-directional encoder BERT and a conditional random field CRF based on a converter, and only the state of the current position of a chinese character and the state of a character at the previous position of the chinese character are considered in consideration of the HMM model used in the prior art, which may make the HMM model unable to capture long dependencies in chinese text. For example, the chinese text "i first works in a company name of a certain technology development limited company" is an entirety in which chinese characters are interdependent. Therefore, the embodiment of the application utilizes the BERT to learn the characteristic function needed by the CRF, so that the CRF can effectively learn the global information of the Chinese short text level through the BERT without manually constructing the characteristic function.

In S204, inputting the feature vector of each chinese character in each chinese short text into the chinese word segmentation model, and outputting the segmented chinese short text, including the following steps:

step (1): and inputting the feature vector of each Chinese character in the Chinese short text into the bi-directional encoder BERT based on the converter aiming at each Chinese short text to obtain the global information vector of each Chinese character in the Chinese short text.

In a specific implementation, for each chinese short text, the feature vector of each chinese character in the chinese short text is input into the BERT together, so that the global information vector of each chinese character can be obtained respectively.

It should be noted that, the core idea of the attention mechanism used by BERT is to calculate the correlations between each chinese character in a chinese short text and all chinese characters in the chinese short text, and then consider the correlations between these chinese characters and chinese characters to reflect the correlations and importance between different chinese characters in the chinese short text to some extent, so that the correlations are used to adjust the new expression representing each chinese character, i.e. the global information vector of the chinese character, which contains not only the chinese character itself, but also the relationships between other chinese characters in the chinese short text and the chinese characters, so that the global information vector and the feature vector are more global expression compared with each other.

Step (2): and inputting the global information vector of each Chinese character in the Chinese short text into the conditional random field CRF, and outputting the segmented Chinese short text.

In a specific implementation, the global information vector of each Chinese character in each obtained Chinese short text is input into the CRF, so that each Chinese short text after word segmentation can be obtained. The embodiment of the application utilizes the BERT to learn the characteristic function required by the CRF, so that the CRF can effectively learn the global information of the Chinese short text level through the BERT without manually constructing the characteristic function.

In a possible implementation manner, the inputting the global information vector of each chinese character in the chinese short text in step (2) into the conditional random field CRF, and outputting the segmented chinese short text includes the following steps:

step a: and inputting the global information vector of each Chinese character in the Chinese short text into the conditional random field CRF, and determining word position labeling information corresponding to each Chinese character in the Chinese short text.

In specific implementation, global information vectors of all Chinese characters in a Chinese short text obtained through BERT are input into CRF, and word position labeling information corresponding to all Chinese characters in the Chinese short text can be determined through CRF.

Here, the word position tagging information includes the start position tagging information of a word, the intermediate position tagging information of a word, the end position tagging information of a word, and the position tagging information of a non-word, and generally, the start position of a word is represented by B, the intermediate position of a word is represented by I, the end position of a word is represented by E, and S represents the position of a non-word, such as a single chinese character.

In an example, if the chinese short text is "i love mountain and lake", all global information vectors corresponding to each chinese character "i", "love", "high", "mountain", "sum", "lake" and "poise" in the chinese short text are input into the CRF together, and word position labeling information such as "SSBESBE" corresponding to all chinese characters respectively may be output.

Step b: and outputting the segmented Chinese short text according to word position labeling information corresponding to each Chinese character in the Chinese short text.

In a specific implementation, after marking word positions of all Chinese characters in the Chinese short text by CRF, the word segmentation result of the Chinese short text after word segmentation can be output according to word position marking information corresponding to all Chinese characters in the Chinese short text.

In an example, if the chinese short text is "i love mountain and lake", and is displayed as "SSBESBE" after being labeled by word positions, the word segmentation result corresponding to the chinese short text is "i/love/mountain/and/or lake", where "/" is used to segment words.

Here, because the output global information vector of the BERT is used as the input of the CRF, the CRF can learn the information of the whole chinese short text, so that the labeling of the word position information can be more accurately performed on each chinese character in the chinese short text, the word segmentation accuracy of the chinese word segmentation model can be improved, and the long dependency problem in the chinese short text can be solved.

In a possible implementation manner, the inputting the global information vector of each chinese character in the chinese short text into the conditional random field CRF in the step a, determining word position labeling information corresponding to each chinese character in the chinese short text, includes the following steps:

step a1: and inputting the global information vector of each Chinese character in the Chinese short text into the conditional random field CRF, and determining the probability of each Chinese character in the Chinese short text at a preset target position.

In a specific implementation, global information vectors of all Chinese characters in a Chinese short text obtained through BERT are input into a CRF, the probability of the corresponding preset target positions of all Chinese characters in the Chinese short text respectively can be determined through the CRF, the preset target positions comprise a starting position of a word, an intermediate position of the word, an ending position of the word and a non-word position, that is, the probability of all Chinese characters in the Chinese short text in different preset target positions can be determined through the CRF, generally, the starting position of one word is represented by B, the intermediate position of one word is represented by I, the ending position of one word is represented by E, and the position of one non-word such as a single Chinese character is represented by S.

In an example, if the chinese short text is "i love mountain and lake", all global information vectors corresponding to each chinese character "i", "high", "mountain", "and", "lake" in the chinese short text are input into the CRF together, and the probabilities of the chinese characters "i", "high", "mountain", "and", "lake" in different preset target positions can be calculated by the CRF, for example, taking the chinese character "i" as an example, the probability of "i" in the start position of the word is 5%, the probability of "i" in the middle position of the word is 4%, the probability of "i" in the end position of the word is 1%, and the probability of "i" in the non-word position of the word is 90%.

Step a2: and determining word position labeling information corresponding to each Chinese character in the Chinese short text according to the probability of each Chinese character in the Chinese short text in the preset target position.

In specific implementation, the probability that each Chinese character in the Chinese short text is at different preset target positions through CRF calculation can be used for marking the corresponding word position of each Chinese character in the Chinese short text.

In an example, if the chinese short text is "i love mountain and lake", taking the chinese character "love" as an example, the probability of "love" at the start position of the word is 3%, the probability of "love" at the middle position of the word is 4%, the probability of "love" at the end position of the word is 13%, and the probability of "love" at the non-word position of the word is 80% can be calculated by CRF, then it can be determined that "love" is the non-word position in the chinese short text, and the word position labeling information is "S".

Here, because the output global information vector of the BERT is used as the input of the CRF, the CRF can learn the information of the whole chinese short text, and further can calculate the probability of each chinese character in the chinese short text at different preset target positions on the premise of knowing the information of the chinese short text, so that the labeling of word position information is more accurately performed on each chinese character in the chinese short text, the word segmentation accuracy of the chinese word segmentation model can be improved, and the long dependency problem in the chinese short text can be solved.

In a possible implementation manner, in the step a2, according to the probability that each chinese character in the chinese short text is at the preset target position, determining word position labeling information corresponding to each chinese character in the chinese short text, respectively, includes the following steps:

step a21: and determining word position labeling information corresponding to each Chinese character in the Chinese short text according to the maximum probability of the probability of each Chinese character in the Chinese short text in the preset target position.

In specific implementation, the probability of each Chinese character in the Chinese short text at different preset target positions can be calculated through CRF, further, the preset target positions corresponding to the maximum probability of each Chinese character are selected respectively, and word positions of each Chinese character in the Chinese short text are marked.

In an example, if the chinese short text is "i love mountain and lake", taking the chinese character "high" as an example, the probability of "high" at the start position of the word is calculated to be 90%, the probability of "high" at the middle position of the word is 4%, the probability of "high" at the end position of the word is 4%, and the probability of "high" at the position other than the word is 2%, then the probability of "high" at the start position of the word is the largest, and it can be determined that "high" corresponds to the start position of the word in the chinese short text, and therefore "high" should be labeled as "B".

Here, because the output global information vector of the BERT is used as the input of the CRF, the CRF can learn the information of the whole chinese short text, so that on the premise of knowing the chinese short text information, the probability of each chinese character in the chinese short text at different preset target positions can be calculated, and the word position of each chinese character is marked according to the preset target position corresponding to the maximum probability of each chinese character in the chinese short text, so that the accuracy of word position marking can be improved, the accuracy of word segmentation by a chinese word segmentation model can be improved, and the long dependency problem in the chinese short text can be solved.

In a possible implementation manner, the determining, in step a21, word position labeling information corresponding to each chinese character in the chinese short text according to the probability that each chinese character in the chinese short text is at the preset target position includes the following steps:

In a specific implementation, the probability of each Chinese character in the Chinese short text at different preset target positions can be calculated through CRF, and further, word positions of each Chinese character in the Chinese short text are marked according to the probability of each Chinese character at different preset target positions and the positions of each Chinese character in the Chinese short text.

In an example, if the chinese short text is "i love mountain and lake", taking the chinese character "i" as an example, the probability that "i" is at the start position of the word is 4%, the probability that "i" is at the middle position of the word is 4%, the probability that "i" is at the end position of the word is 50%, the probability that "i" is at the non-word position of the word is 42%, and the position that "i" is at the first place "i love mountain and lake", although the probability that "i" is at the end position of the word is the greatest, it is determined that "i" is unlikely to be at the end position of the word according to the position of "i" in "i love mountain and lake", and then the position that "i" is not the word is finally determined, so "i" should be marked as "S".

Here, because the embodiment of the application uses the probability of each Chinese character at different preset positions and the position of the corresponding Chinese short text as the consideration of marking the word position of each Chinese character, the error rate of marking the word position can be reduced, the accuracy of marking the word position is further improved, and the aim of improving the accuracy of word segmentation of the Chinese word segmentation model is fulfilled.

In one possible embodiment, the preset target position includes: the starting position of a word, the middle position of a word, the ending position of a word, and the position of a non-word.

In implementations, a starting position of a word may be represented by B, a middle position of a word by I, an ending position of a word by E, and a non-word position, such as a separate chinese character, may be represented by S.

Example III

Based on the same application conception, the third embodiment of the present application further provides a text word segmentation device corresponding to the text word segmentation methods provided in the first and second embodiments, and since the principle of solving the problem by the device in the embodiment of the present application is similar to that of the text word segmentation methods provided in the first and second embodiments of the present application, the implementation of the device can refer to the implementation of the method, and the repetition is omitted.

Referring to fig. 3, which is one of functional block diagrams of a text word segmentation device 300 according to a third embodiment of the present application, and referring to fig. 4, which is the second functional block diagram of a text word segmentation device 300 according to a third embodiment of the present application, the text word segmentation device 300 includes:

an obtaining module 310, configured to obtain a chinese text to be processed;

a segmentation module 320, configured to segment the chinese text into a plurality of chinese short texts; wherein each of said chinese short text comprises a plurality of chinese characters in succession representing a semantic meaning;

and the output module 330 is configured to output the segmented chinese text based on the segmented chinese short texts and a pre-trained chinese word segmentation model.

In one possible implementation, referring to fig. 3 and 4, the segmentation module 320 is specifically configured to segment the chinese text into a plurality of chinese short texts according to the following steps:

In one possible implementation, referring to fig. 4, the word segmentation apparatus 300 for text further includes an extraction module 340;

the extracting module 340 is configured to perform feature extraction on each chinese character in the chinese short text, so as to obtain a feature vector of each chinese character;

the output module 330 is configured to output the segmented chinese text according to the following steps:

In one possible implementation, referring to fig. 4, the extracting module 340 is specifically configured to extract the feature vector of each chinese character according to the following steps:

In one possible implementation, referring to fig. 3 and 4, the chinese word segmentation model includes a converter-based bi-directional encoder BERT and a conditional random field CRF;

the output module 330 is specifically configured to output the segmented chinese short text according to the following steps:

In one possible implementation, referring to fig. 4, the output module 330 includes a determination module 332;

The determining module 332 is configured to input the global information vector of each chinese character in the chinese short text into the conditional random field CRF, and determine word position labeling information corresponding to each chinese character in the chinese short text;

the output module 330 is further configured to output the segmented chinese short text according to word position labeling information corresponding to each chinese character in the chinese short text.

In a possible implementation manner, the determining module 332 is configured to determine word position labeling information corresponding to each chinese character according to the following steps:

In one possible implementation, referring to fig. 4, the determining module 332 is further configured to determine word position labeling information corresponding to each chinese character according to the following steps:

Example IV

Based on the same application concept, referring to fig. 5, a schematic structural diagram of an electronic device 500 according to a fourth embodiment of the present application is shown, including: a processor 510, a memory 520 and a bus 530, the memory 520 storing machine readable instructions executable by the processor 510, the processor 510 and the memory 520 communicating over the bus 530 when the electronic device 500 is running, the machine readable instructions being executable by the processor 510 to perform the steps of the method of word segmentation of text as described in any one of the first and/or second embodiments.

Specifically, the machine-readable instructions, when executed by the processor 510, perform the following:

acquiring a Chinese text to be processed;

In the embodiment of the present application, the electronic device 500 divides a chinese text into a plurality of chinese short texts by performing obtaining the chinese text to be processed, and outputs the divided chinese text based on the divided plurality of chinese short texts and a pre-trained chinese word segmentation model, so that the length of the chinese text can be reduced, interference of non-chinese characters can be filtered, and further, the word segmentation efficiency of the chinese text can be improved.

Example five

Based on the same application conception, the fifth embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the steps of the text segmentation method provided in the first embodiment and/or the second embodiment are executed.

Specifically, the storage medium can be a general storage medium, such as a mobile disk, a hard disk, and the like, and when the computer program on the storage medium is run, the word segmentation method of the text can be executed, so that the length of the Chinese text can be reduced, the interference of non-Chinese characters can be filtered, and the word segmentation efficiency of the Chinese text can be improved.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described system and apparatus may refer to corresponding procedures in the foregoing method embodiments, which are not described herein again. In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily appreciate variations or alternatives within the scope of the present application. Therefore, the protection scope of the application is subject to the protection scope of the claims.

Claims

1. A method for word segmentation of text, the method comprising:

acquiring a Chinese text to be processed;

inputting the feature vector of each Chinese character in the Chinese short text into a bi-directional encoder BERT based on a converter of a Chinese word segmentation model to obtain a global information vector of each Chinese character in the Chinese short text;

inputting the global information vector of each Chinese character in the Chinese short text into a Conditional Random Field (CRF), and determining the probability of each Chinese character in the Chinese short text at a preset target position;

Determining word position labeling information corresponding to each Chinese character in the Chinese short text according to the probability of each Chinese character in the Chinese short text in the preset target position;

2. The word segmentation method according to claim 1, wherein the segmentation of the chinese text into a plurality of chinese short texts comprises:

3. The word segmentation method according to claim 1, wherein the feature extraction of each of the chinese characters in the chinese short text to obtain a feature vector of each of the chinese characters includes:

4. The word segmentation method according to claim 1, wherein the determining word position labeling information corresponding to each of the chinese characters in the chinese short text according to the probability that each of the chinese characters in the chinese short text is at the preset target position includes:

5. The word segmentation method according to claim 1, wherein the determining word position labeling information corresponding to each of the chinese characters in the chinese short text according to the probability that each of the chinese characters in the chinese short text is at the preset target position includes:

6. The word segmentation method according to any one of claims 1, 4, and 5, characterized in that the preset target position includes:

7. A word segmentation apparatus for text, the word segmentation apparatus comprising:

the acquisition module is used for acquiring the Chinese text to be processed;

the output module is used for inputting the feature vector of each Chinese character in the Chinese short text into a bi-directional encoder BERT based on a converter of a Chinese word segmentation model to obtain a global information vector of each Chinese character in the Chinese short text;

the output module comprises a determination module; the determining module is used for inputting the global information vector of each Chinese character in the Chinese short text into a conditional random field CRF and determining the probability of each Chinese character in the Chinese short text at a preset target position; determining word position labeling information corresponding to each Chinese character in the Chinese short text according to the probability of each Chinese character in the Chinese short text in the preset target position;

8. The word segmentation device according to claim 7, wherein the segmentation module is specifically configured to segment the chinese text into a plurality of chinese short texts according to the following steps:

9. The word segmentation device according to claim 7, wherein the extracting module is specifically configured to extract the feature vector of each chinese character according to the following steps:

10. The word segmentation device according to claim 7, wherein the determining module is further configured to determine word position labeling information corresponding to each chinese character according to:

11. The word segmentation device according to claim 7, wherein the determining module is further configured to determine word position labeling information corresponding to each chinese character according to:

12. The word segmentation device according to any one of claims 7, 10, 11, wherein the preset target location includes:

13. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication over the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the method of word segmentation of text as claimed in any one of claims 1 to 6.

14. A computer-readable storage medium, characterized in that the computer-readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the text word segmentation method according to any one of claims 1 to 6.