WO2020232881A1

WO2020232881A1 - Text word segmentation method and apparatus

Info

Publication number: WO2020232881A1
Application number: PCT/CN2019/103069
Authority: WO
Inventors: 陈诗锦
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-05-20
Filing date: 2019-08-28
Publication date: 2020-11-26
Also published as: CN110222335A

Abstract

Disclosed in the embodiments of the present application are a text word segmentation method and apparatus, the method comprising: acquiring a text to be processed; on the basis of a character string matching word segmentation policy, performing word segmentation of the text to be processed along a first direction to obtain first word segmentation results; on the basis of a character string matching word segmentation policy, performing word segmentation of the text to be processed along a second direction to obtain second word segmentation results; and, if the first word segmentation results are consistent with the second word segmentation results, then outputting the first word segmentation results or the second word segmentation results. By means of the present application, word segmentation accuracy for a text to be processed can be achieved.

Description

Text segmentation method and device

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 20, 2019, the application number is: 201910423046.7, and the application name is "a method and device for text segmentation", the entire content of which is incorporated into this application by reference in.

Technical field

This application relates to the technical field of natural language processing, and in particular to a method and device for text segmentation.

Background technique

In an era when voice interactive products are popular, voice recognition and natural voice processing each play an important role. Among them, voice recognition refers to the decoding of voice signals into text information; natural language processing refers to semantic analysis based on text information to obtain the user's request intention, so as to meet the user's functional needs. Chinese word segmentation is an important step in natural speech understanding, and its accuracy directly affects the performance of human-computer interaction products.

The so-called word segmentation refers to the segmentation of sentences into individual words, which is the process of recombining consecutive sentences into word sequences according to certain specifications. Take Chinese word segmentation technology as an example. The goal of word segmentation technology is to segment a sentence into individual Chinese words.

In the prior art, after the terminal obtains the user’s voice information, the terminal converts the above voice information to obtain the text to be processed. Then, the terminal combines the character string in the text to be processed with a preset dictionary library according to a certain strategy. If an entry is found in the preset dictionary library, it means that the matching is successful. At this time, the entry is obtained, and then the word segmentation result of the text to be processed can be obtained. However, in practical applications, in the process of segmenting the to-be-processed text according to a certain strategy, the segmentation results are not accurate enough due to the roughness and randomness of the segmentation process.

In the specific implementation, the inaccuracy of the word segmentation results involved here refers to: in the process of segmenting the text to be processed according to a certain strategy, there are multiple word segmentation methods, and different word segmentation methods can produce different word segmentation results. In an ideal state Next, among these multiple word segmentation results, there is only one best word segmentation result. Taking the text to be processed as "Southern City Nanjing" as an example, the entries collected in the preset dictionary database include: South, Southern City, City, City, Nanjing, then, in this case, the terminal will process the text The word segmentation result of can include: South City/City/Nanjing; it can also include: South/City/Nanjing, and the best word segmentation result in an ideal state should be: South/City/Nanjing.

Then, how to determine the best word segmentation result among multiple word segmentation results to improve the accuracy of word segmentation for the text to be processed is a hot technical issue that people study.

Summary of the invention

The embodiments of the present application provide a text word segmentation method and device, which can improve the accuracy of the terminal for word segmentation of the text to be processed.

In the first aspect, an embodiment of the present application provides a method for text segmentation, which includes:

Get the text to be processed;

Perform word segmentation on the to-be-processed text along the first direction according to the word segmentation strategy for string matching to obtain a first word segmentation result;

Perform word segmentation on the to-be-processed text in a second direction according to the word segmentation strategy matched by the character string to obtain a second word segmentation result;

If the first word segmentation result is consistent with the second word segmentation result, output the first word segmentation result or the second word segmentation result.

In a second aspect, an embodiment of the present application provides a text word segmentation device. The text word segmentation device includes a unit for executing the method of the first aspect. Specifically, the text word segmentation device includes:

The obtaining unit is used to obtain the text to be processed;

The first word segmentation unit is configured to segment the to-be-processed text in the first direction according to the word segmentation strategy for string matching to obtain the first word segmentation result;

The second word segmentation unit is configured to segment the to-be-processed text in the second direction according to the word segmentation strategy matched by the character string to obtain a second word segmentation result;

The output unit is configured to output the first word segmentation result or the second word segmentation result when the first word segmentation result is consistent with the second word segmentation result.

In a third aspect, an embodiment of the present application provides another terminal, including a processor configured to call stored program instructions to execute the method of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium that stores a computer program. The computer program includes program instructions that, when executed by a processor, cause all The processor executes the method of the first aspect described above.

By implementing the embodiments of this application, the terminal performs two word segmentation operations on the text to be processed instead of performing rough word segmentation on the text to be processed, which can avoid the randomness in the process of rough word segmentation in the prior art and improve the terminal’s word segmentation for the text to be processed Accuracy.

Description of the drawings

FIG. 1 is a schematic flowchart of a method for text segmentation provided by an embodiment of the present application;

FIG. 2 is a schematic flowchart of a text word segmentation method provided by another embodiment of the present application;

3A is a schematic diagram of multiple individual characters obtained after splitting the text to be processed according to an embodiment of the present application;

Fig. 3B is a schematic diagram of a directed acyclic graph provided by an embodiment of the present application;

3C is a schematic diagram of another directed acyclic graph provided by an embodiment of the present application;

4A is a schematic block diagram of a text word segmentation device provided by an embodiment of the present application;

4B is a schematic block diagram of another text word segmentation device provided by an embodiment of the present application;

FIG. 5 is a schematic block diagram of a terminal according to another embodiment of the present application.

Detailed ways

The embodiments of the present application will be described below in conjunction with the drawings.

In specific implementation, the terminals described in the embodiments of the present application include but are not limited to other portable devices such as mobile phones, laptop computers, or tablet computers with touch-sensitive surfaces (for example, touch screen displays and/or touch pads). It should also be understood that, in some embodiments, the device is not a portable communication device, but a desktop computer with a touch-sensitive surface (e.g., touch screen display and/or touch pad).

In the following discussion, a terminal including a display and a touch-sensitive surface is described. However, it should be understood that the terminal may include one or more other physical user interface devices such as a physical keyboard, mouse, and/or joystick.

The terminal supports various applications, such as one or more of the following: drawing application, presentation application, word processing application, website creation application, disk burning application, spreadsheet application, game application, telephone application Applications, video conferencing applications, email applications, instant messaging applications, exercise support applications, photo management applications, digital camera applications, digital camera applications, web browsing applications, digital music player applications, and / Or digital video player application.

Various application programs that can be executed on the terminal can use at least one common physical user interface device such as a touch-sensitive surface. One or more functions of the touch-sensitive surface and corresponding information displayed on the terminal can be adjusted and/or changed between applications and/or within corresponding applications. In this way, the common physical architecture of the terminal (for example, a touch-sensitive surface) can support various applications with a user interface that is intuitive and transparent to the user.

With reference to the schematic flow chart of the text word segmentation method provided by the embodiment of the application shown in FIG. 1, the following specifically describes how the embodiment of the application implements accurate word segmentation for the text, which may include but not limited to the following steps:

S100. Obtain the text to be processed.

In one of the implementation manners, the terminal obtains the to-be-processed text according to the voice signal of the speaking user. In this case, the terminal first obtains the voice signal of the speaking user, and then converts the obtained voice signal of the speaking user into text information, and obtains the text to be processed from the text information. For example, the terminal may use voice recognition technology to convert the voice signal of the speaking user into text information, and then obtain the text to be processed from the text information.

In another implementation manner, the terminal may directly receive the text information corresponding to the user's voice signal from the voice recognition device, and obtain the text to be processed from the text information.

In practical applications, the speaking users involved here may include: users who speak and emit voice signals in the simultaneous translation scene, and/or users who generate voice signals through a terminal, for example, through a microphone or other voice collection devices Receive the voice signal of the speaking user.

In another implementation manner of the present application, the terminal may obtain the text to be processed according to the text input by the user. For example, text entered by users in instant messaging, office documents, and other scenarios.

Exemplarily, the text to be processed may be "Beijing college students drink imported red wine", or "Nanjing, a southern city", etc., which is not specifically limited in the embodiment of the application.

Step S102: Perform word segmentation on the to-be-processed text along the first direction according to the word segmentation strategy for string matching to obtain a first word segmentation result.

In a specific implementation, the word segmentation strategy based on string matching to segment the text to be processed along the first direction to obtain the first word segmentation result includes:

Determine the first character of the text to be processed according to the first direction;

The first character is taken as the current character, and the entry consisting of the current character and the M characters adjacent to it is matched with entries in a preset dictionary library in a matching manner to obtain the beginning of the current character To obtain the first word segmentation result; where M is greater than or equal to 1 and less than or equal to Q, and the Q is the number of characters in the text to be processed.

Taking the application scenarios of college students’ daily life as an example, the preset expression forms of the dictionary library include but are not limited to those shown in Table 1:

Table 1 Default dictionary library

词条 Entry	权重Weights

北京Beijing	44
北京大学 Beijing University	11
大学生 College Students	55
进口 import	44
红酒 Red wine	66

It should be noted that, in specific implementation, the weight corresponding to an entry represents the probability of the entry in a specific application scenario, and the greater the weight, the greater the probability of the entry. Then, in the process of determining the word segmentation result of the text to be processed, in the case where the word segmentation result has multiple manifestations, the term with a high weight is selected as the word segmentation result.

Taking the terms "Beijing" and "Peking University" as examples, when determining the word segmentation result of the text "Beijing University students drinking imported red wine" to be processed, the terminal preferably uses "Beijing" as the word segmentation result.

In the embodiment of the present application, the preset dictionary library contains as much as possible all the entries that may appear in a specific application scenario. Through this implementation, it is possible to avoid the occurrence of unmatched word segmentation results.

In one of the embodiments, the entries in the preset dictionary library are arranged in the order of weight.

In practical applications, in the case that the preset dictionary library has the form shown in Table 1, the terminal can sort the entries in the preset dictionary library in the order of weight. For example, in a possible In the implementation manner, the terminal arranges the entries in the preset dictionary library in descending order of weight; for another example, in another possible implementation manner, the terminal arranges the entries in the preset dictionary library in accordance with the weight Arrange from small to large. For ease of explanation, Table 2 is an expression form of a preset dictionary library provided by an embodiment of this application, wherein the entries in the preset dictionary library are arranged in descending order of weight.

Table 2 Default dictionary library

词条Entry	权重Weights
红酒 Red wine	66
大学生 College Students	55
北京 Beijing	44
进口 import	44
北京大学 Beijing University	11

As mentioned above, the weight corresponding to an entry represents the probability of the entry in a specific application scenario. The greater the weight, the greater the probability of the entry. Then, in this case, the terminal uses the first character as the current character, and matches the entry composed of the current character and the adjacent M characters with entries in the preset dictionary library that are greater than the preset weight. To get the entry at the beginning of the current word. For example, take the text "Beijing University Students Drinking Imported Red Wine" as an example. For the first character "North", the preset dictionary library contains two entries starting with the character "北", which are "Beijing". And "Peking University", where the term "Beijing" corresponds to a weight of 4, and the term "Peking University" corresponds to a weight of 1. Assuming that the preset weight is 3, the terminal is determined to use the character "北When the entry corresponding to "", directly match the entry "Beijing" in the preset dictionary database in a matching manner, instead of matching "Peking University" and then "Beijing" (or, matching " Beijing, then match "Peking University"). Through this implementation method, since the terminal can directly match the entries in the preset dictionary library that are greater than the preset weight, the word segmentation results of the text to be processed can be determined in the shortest time. Thereby, the efficiency of word segmentation in the process of word segmentation can be improved.

In the embodiment of the present application, the setting of the preset weight is diversified. For example, in different application scenarios, the setting of the preset weight may be different or the same, which is not specifically limited in the embodiment of the present application.

Furthermore, it should be noted that, in different application scenarios, the entries included in the aforementioned preset dictionary library are different, so that the blindness of the terminal in the matching process can be reduced.

For example, in the application scenario of collection, the expression form of the preset dictionary library can be as shown in Table 3:

Table 3 The preset dictionary library under the collection application scenario

词条 Entry	权重Weights

贷款loan	66
借款 loan	33
金额 Amount	22
欠owe	44
期限the term	11

Exemplarily, the first direction may be from left to right or from right to left, which is not specifically limited in the embodiment of the present application. For ease of explanation, in the embodiments of the present application, the first direction is from left to right as an example for description.

In this case, the terminal determines that the first character of the to-be-processed text "Beijing University Students Drinking Imported Red Wine" is "北", and uses the Chinese character "北" as the current character. The terminal groups the current word and its adjacent M characters (for example, M=1) to form a word (for example, Beijing) to obtain an entry, and then queries whether the entry exists in the preset dictionary database. If the entry exists in the dictionary database, the entry is determined as the word segmentation result. In practical applications, each character in the text to be processed can be used as the current character, and the first segmentation result of the text to be processed can be obtained by repeating the above operations (for example, grouping words, matching). For example, after the terminal uses the word segmentation method described above to segment the text "Beijing college students drinking imported red wine" in the first direction, the first word segmentation results obtained are: "Beijing", "college students", "drink", "wine ".

Step S104: Perform word segmentation on the to-be-processed text in a second direction according to the word segmentation strategy matched by the character string to obtain a second word segmentation result.

In a specific implementation, the segmentation of the text to be processed in the second direction according to the word segmentation strategy of the character string matching to obtain the second segmentation result includes:

Determine the first character of the text to be processed according to the second direction;

The first character is taken as the current character, and the entry consisting of the current character and the M characters adjacent to it is matched with entries in a preset dictionary library in a matching manner to obtain the beginning of the current character To obtain the first word segmentation result; where N is greater than or equal to 1 and less than or equal to Q, and the Q is the number of characters in the text to be processed.

In a specific implementation, the second direction involved here may be the same as the first direction or opposite to the first direction.

In one of the embodiments, when the second direction is the same as the first direction, in this case, the same word segmentation operation is performed twice for the text to be processed, which can avoid randomness in determining the word segmentation result during the word segmentation process. Sex. The randomness here is reflected in the uncertainty in the terminal matching process when the preset dictionary library contains multiple entries with a certain character as the current character. When the word segmentation result of a certain to-be-processed text obtained by the terminal according to the method described in this application is inconsistent, the terminal can obtain the word segmentation result of the to-be-processed text again according to the dynamic programming algorithm.

As a preferred implementation, when the second direction is the opposite of the first direction, in this case, that is, a backtracking operation is performed on the text to be processed, the accuracy of the word segmentation result is better than when the first The participle result when the direction is the same as the second direction. The following will take the second direction as the opposite direction of the first direction as an example for specific explanation:

As mentioned above, the first direction is from left to right, at this time, the second direction is from right to left.

Then, in this case, the terminal determines that the first character of the to-be-processed text "Beijing university students drink imported red wine" is "wine", and "wine" is the current word. The terminal groups the current word and its adjacent M characters (for example, M=1) to obtain an entry (for example, red wine), and then queries whether the entry exists in the preset dictionary database. If the entry exists in the dictionary database, the entry is determined as the word segmentation result. In practical applications, each character in the text to be processed can be used as the current character, and the second segmentation result of the text to be processed can be obtained by repeating the above operations (for example, word grouping, matching). For example, after the terminal uses the word segmentation method described above to segment the text "Beijing college students drinking imported red wine" in the second direction, the second word segmentation results obtained can be: "Beijing", "college students", "drink", " Red wine". For another example, after the terminal uses the word segmentation method described above to segment the text "Beijing University Students Drink Imported Red Wine" in the second direction, the second word segmentation result can also be: "Peking University", "Sheng", "Drink" ", "Red Wine".

Step S106: If the first word segmentation result is consistent with the second word segmentation result, output the first word segmentation result or the second word segmentation result.

In a specific implementation, the terminal may determine whether the first word segmentation result is consistent with the second word segmentation result by comparing one by one.

For example, according to the word segmentation strategy of string matching, the terminal performs word segmentation along the first direction to process the text "Beijing college students drink imported red wine", and the first word segmentation results obtained are: "Beijing", "college students", "drink", "wine" . According to the word segmentation strategy of string matching, the terminal performs word segmentation along the second direction to process the text "Beijing University Students Drinking Imported Red Wine", and the second word segmentation results obtained are: "Beijing", "College Student", "Drink", and "Red Wine". After obtaining the above two word segmentation results, the terminal uses a one-by-one comparison to determine that the first word segmentation result is consistent with the second word segmentation result. In this case, the terminal can output either the first word segmentation result or the second word segmentation result.

It is understandable that after the correct word segmentation result is obtained for the text to be processed, the terminal outputs the word segmentation result, which means that the terminal can better understand the sentence meaning of the speaking user.

In the embodiment of this application, the expression form of outputting the word segmentation result may be: the terminal displays the word segmentation result of the text to be processed on the display screen, or the terminal outputs the word segmentation result of the text to be processed when broadcasting by voice. During the voice broadcasting process, There is a pause between each word segmentation result so that users can better understand the word segmentation result.

Taking the collection application scenario as an example, the terminal can better determine the economic status of the speaking user based on the word segmentation result (for example, the speaking user can repay the arrears, the speaking user cannot repay the arrears, etc.), and the collector can obtain the user's economic status After the situation, a reasonable decision can be made according to the user's economic situation to improve the collection effect.

By implementing the embodiments of this application, the terminal performs two word segmentation operations on the text to be processed instead of performing rough word segmentation on the text to be processed, which can avoid the randomness in the implementation of rough word segmentation in the prior art, and can improve the terminal’s word segmentation for the text to be processed Accuracy.

It should be noted that, in actual applications, in the same application scenario, considering the particularity of the application scenario, the number of texts to be processed is often more than one, and more than one in more cases. In the case where there are multiple texts to be processed, for example, the text to be processed includes a first text to be processed and a second text to be processed, the terminal may segment the second word segmentation result based on the word segmentation result of the first text to be processed, That is, the terminal combines the context (or context) to segment the text to be processed, so as to improve the accuracy of word segmentation of the text to be processed by the terminal. In a specific implementation, the terminal can combine a deep learning algorithm to determine the word segmentation result of the second text to be processed. Specifically, the terminal combining the deep learning algorithm to determine the word segmentation result of the second to-be-processed text may include: training the model according to the first word segmentation result to obtain a trained deep learning model; then, pairing with the trained deep learning model The second text to be processed is processed to obtain the second word segmentation result.

In the embodiment of the present application, the deep learning algorithm may include, but is not limited to, a deep learning neural network model (deep neural network, DNN), a long short-term memory network model (LSTM, Long Short-Term Memory), and so on.

Take the long and short-term memory neural network model as an example. Specifically, the LSTM model uses input gates, output gates, forget gates, and cell structures to control the learning and forgetting of historical information, making the model suitable for processing long sequences problem. In the embodiment of this application, the model determines the word segmentation result of the next text to be processed based on the word segmentation results of N historical texts to be processed in the same application scenario. Here, N is a positive integer greater than 0, and there are N consecutive histories. Entries determined by matching in the text to be processed are also applicable to the matching of entries in the next text to be processed. This implementation method can improve the accuracy and efficiency of the word segmentation results.

Optionally, after the terminal performs step S106, the terminal may also perform step S108. The following describes in detail how the embodiment of this application implements word segmentation for the text to be processed in conjunction with the text segmentation method shown in FIG. 2. Next, step S108 To elaborate:

Step S108: If the first word segmentation result is inconsistent with the second word segmentation result, perform word segmentation on the text to be processed through a dynamic programming algorithm to obtain a third word segmentation result.

For example, according to the word segmentation strategy of string matching, the terminal performs word segmentation along the first direction to process the text "Beijing college students drink imported red wine", and the first word segmentation results obtained are: "Beijing", "college students", "drink", "wine" ; According to the word segmentation strategy of string matching, the terminal performs word segmentation on the processing text "Beijing University Students Drinking Imported Red Wine" in the second direction, and the second word segmentation results obtained are: "Peking University", "Sheng", "Drink", "Red Wine" . The terminal uses a one-by-one comparison method to determine that the first word segmentation result is inconsistent with the second word segmentation result. In this case, it means that there is an ambiguous field. At this time, the terminal uses a dynamic programming algorithm to analyze the above pending text "Beijing college students drink imported red wine" Perform word segmentation to get the third word segmentation result.

In a specific implementation, the segmentation of the text to be processed by a dynamic programming algorithm to obtain the third segmentation result includes:

Split the to-be-processed text to obtain multiple individual characters;

A directed acyclic graph is constructed according to the relevance of adjacent characters in the multiple individual characters; wherein, the directed acyclic graph includes multiple paths, and each of the multiple paths includes The entry and the weight corresponding to the entry;

Determining the weight sum of all entries on each path in the directed acyclic graph;

Determine the entry on the path with the smallest weight and as the third word segmentation result.

Taking the to-be-processed text "Beijing University Students Drinking Imported Red Wine" as an example, the terminal splits the to-be-processed text to obtain multiple individual characters as shown in FIG. 3A, where each character can represent a node.

After that, the terminal constructs a directed acyclic graph according to the relevance of adjacent characters among the multiple individual characters. The relevance of adjacent characters mentioned here means that two adjacent characters can form an entry. Taking the character "北" as an example, the entries that can be composed of the character "北" are: "Beijing", "Peking University", and "Beijing University Student".

For example, the directed acyclic graph constructed by the terminal on the text to be processed by the dynamic programming algorithm may be as shown in Table 3B. As shown in Table 3B, the directed acyclic graph includes multiple paths as shown below, and each path includes an entry and the corresponding weight of the entry:

Among them, the entries included in Path 1 are: Beijing (4)-college students (5)-drink (5)-import (4)-red wine (6);

The entries included in Path 2 are: Peking University (4)-Health (6)-Drink (5)-Import (4)-Red Wine (6);

The entries included in Path 3 are: Beijing (4)-college students (5)-drink (5)-enter (2)-lipstick (8)-wine (2).

After obtaining multiple paths, the terminal determines the weight sum of all entries on each path.

Taking the foregoing path 1 as an example, the terminal determines that the sum of the weights of all entries on the path 1 is 4+5+5+4+6=24.

Using the same calculation method described above, the terminal determines that the sum of the weights of all the entries on the path 2 is 25; the terminal determines that the sum of the weights of all the entries on the path 3 is 26.

After the terminal determines the weight sum of all the entries on each path in the directed acyclic graph, the terminal determines the entry on the path with the smallest weight sum as the third word segmentation result.

For example, the terminal sequentially compares the weight sum of path 1 with the weight sum of path 2, and the weight sum of path 3, and the terminal determines that the weight sum of path 1 is the minimum of the weight sums of the three paths. Then, in this In this case, the terminal determines the entry on path 1 as the third word segmentation result, that is, the third word segmentation result of the terminal for the above-mentioned text to be processed "Beijing university students drink imported red wine" is: "Beijing", "University students", " "Drink", "Red Wine".

By implementing the embodiments of this application, when the first word segmentation result is inconsistent with the second word segmentation result, it indicates that there is an ambiguous field. At this time, the terminal determines the word segmentation result of the text to be processed through the dynamic programming algorithm and the minimum path principle, which can avoid Ambiguity fields, which can improve the accuracy of the terminal for word segmentation of the text to be processed.

In practical applications, the first word segmentation result is all the entries on the first path in the directed acyclic graph after the word to be processed is segmented by the dynamic programming algorithm, and the second word segmentation result is the All entries on the second path in the directed acyclic graph;

The segmentation of the text to be processed by the dynamic programming algorithm to obtain the third segmentation result includes:

Respectively determine the weight sum of all the entries on the first path and the weight sum of all the entries on the second path;

If the weight sum of all entries on the first path is less than the weight sum of all entries on the second path, determining the first word segmentation result as the third word segmentation result;

If not, determine the second word segmentation result as the third word segmentation result.

For example, the terminal uses a dynamic programming algorithm to construct a directed acyclic graph of the text to be processed, as shown in Figure 3C. The terminal determines that the first word segmentation result is all the entries on path 1 in the directed acyclic graph, and the second word segmentation result is For all the entries on path 2 in the directed acyclic graph, in this case, the terminal determines the third word segmentation result from the first word segmentation result and the second word segmentation result, which can improve the word segmentation efficiency of the terminal for the text to be processed.

For example, the terminal calculates the weight sum of path 1 in FIG. 3C as: 24; the terminal calculates the weight sum of path 2 in FIG. 3C as: 25.

The terminal judges that the weight sum of the first word segmentation result is less than the weight sum of the second word segmentation result, and at this time, the terminal outputs the first word segmentation result. That is, the terminal determines that the word segmentation result of the text to be processed is: "Beijing", "College Student", "Drink", and "Red Wine".

By implementing this application, while improving the accuracy of the terminal's word segmentation for the text to be processed, the terminal's word segmentation efficiency for the text to be processed can also be improved.

In order to facilitate better implementation of the above solutions of the embodiments of this application, this application also provides a corresponding text segmentation device, which will be described in detail below with reference to the accompanying drawings:

As shown in FIG. 4A, a schematic structural diagram of a text segmentation device provided by an embodiment of the present application, the text segmentation device 40 may include: an acquisition unit 400, a first segmentation unit 402, a second segmentation unit 404, and an output unit 406;

Wherein, the obtaining unit 400 is used to obtain the text to be processed;

The first word segmentation unit 402 is configured to segment the to-be-processed text in the first direction according to the word segmentation strategy for string matching to obtain the first word segmentation result;

The second word segmentation unit 404 is configured to segment the to-be-processed text in the second direction according to the word segmentation strategy of the character string matching to obtain a second word segmentation result;

The output unit 406 is configured to output the first word segmentation result or the second word segmentation result when the first word segmentation result is consistent with the second word segmentation result.

Optionally, as shown in FIG. 4B, the text word segmentation device 40 further includes: a third word segmentation unit 408;

The third word segmentation unit 408 is configured to perform word segmentation on the to-be-processed text through a dynamic programming algorithm when the first word segmentation result is inconsistent with the second word segmentation result to obtain a third word segmentation result.

Wherein, the third word segmentation unit 408 includes: a segmentation unit, a construction unit, a first determination unit, and a second determination unit; wherein,

The splitting unit is configured to split the to-be-processed text to obtain multiple individual characters;

The construction unit is configured to construct a directed acyclic graph according to the relevance of adjacent characters in the plurality of individual characters; wherein the directed acyclic graph includes multiple paths, and the multiple paths Each path in includes an entry and the weight corresponding to the entry;

The first determining unit is configured to determine the weight sum of all entries on each path in the directed acyclic graph;

The second determining unit is configured to determine the entry on the path with the smallest weight and as the third word segmentation result.

Optionally, the first word segmentation result is all the entries on the first path in the directed acyclic graph after the word segmentation of the text to be processed by a dynamic programming algorithm, and the second word segmentation result is the All the entries on the second path in the acyclic graph;

The third word segmentation unit 408 includes: a third determination unit and a fourth determination unit; wherein,

The third determining unit is configured to determine the weight sum of all entries on the first path and the weight sum of all entries on the second path respectively;

The fourth determining unit is configured to determine the first word segmentation result as the third when the weight sum of all entries on the first path is less than the weight sum of all entries on the second path Word segmentation result

The fourth determining unit is further configured to determine the second word segmentation result as the first when the weight sum of all entries on the first path is greater than the weight sum of all entries on the second path Three participle results.

Wherein, the first word segmentation unit 402 includes: a fifth determination unit and a matching unit;

Wherein, the fifth determining unit is configured to determine the first character of the text to be processed according to the first direction;

The matching unit is configured to use the first character as the current character, and match the entry consisting of the current character and the adjacent M characters with entries in a preset dictionary database in a matching manner, The first word segmentation result is obtained by obtaining the entry at the beginning of the current word; where M is greater than or equal to 1 and less than or equal to Q, and the Q is the number of characters in the text to be processed.

Optionally, the entries in the preset dictionary library are arranged in the order of weight; the matching unit is specifically configured to:

The first character is taken as the current character, and the entry consisting of the current character and the adjacent M characters is matched with entries greater than the preset weight in the preset dictionary library in a matching manner to obtain all State the entry at the beginning of the current character to obtain the first word segmentation result.

Optionally, the second direction is the opposite direction of the first direction.

Optionally, the number of the text to be processed is at least two, and the at least two texts to be processed belong to the same application scenario, wherein the at least two texts to be processed include a first text to be processed and a second text to be processed. For processing text, the text word segmentation device 40 may further include:

The processing unit 4010 is configured to, after obtaining the word segmentation result of the first text to be processed, determine the word segmentation result of the second text to be processed according to the word segmentation result of the first text to be processed.

In order to facilitate better implementation of the above-mentioned solutions of the embodiments of this application, this application also provides another terminal correspondingly, which will be described in detail below with reference to the accompanying drawings:

FIG. 5 shows a schematic structural diagram of a terminal provided in an embodiment of the present application. The terminal 50 may include a processor 501, a memory 504, and a communication module 505. The processor 501, the memory 504, and the communication module 505 may be connected to each other through a bus 506. The memory 504 may be a high-speed random access memory (RAM) memory, or a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory 504 may also be at least one storage system located far away from the aforementioned processor 501. The memory 504 is used to store application program codes, which may include an operating system, a network communication module, a user interface module, and a data processing program. The communication module 505 is used to exchange information with external devices; the processor 501 is configured to call the program code, Perform the following steps:

Get the text to be processed;

Among them, the processor 501 may also be used for:

If the first word segmentation result is inconsistent with the second word segmentation result, a dynamic programming algorithm is used to segment the text to be processed to obtain a third word segmentation result.

Wherein, the processor 501 performs word segmentation on the to-be-processed text through a dynamic programming algorithm to obtain the third word segmentation result, which may include:

Split the to-be-processed text to obtain multiple individual characters;

Wherein, the first word segmentation result is all the entries on the first path in the directed acyclic graph after the word segmentation of the text to be processed by the dynamic programming algorithm, and the second word segmentation result is the directed acyclic graph. All entries on the second path in the ring graph;

The processor 501 performs word segmentation on the to-be-processed text through a dynamic programming algorithm to obtain a third word segmentation result, which may include:

Wherein, the processor 501 performs word segmentation on the to-be-processed text in the first direction according to the word segmentation strategy for string matching to obtain the first word segmentation result, which may include:

Wherein, the entries in the preset dictionary library are arranged in the order of weight; the processor 501 regards the first character as the current character, and compares the current character with the M characters adjacent to it in a matching manner. The composed entries are matched with entries in the preset dictionary library to obtain entries at the beginning of the current word to obtain the first word segmentation result, including:

Wherein, the second direction is the opposite direction of the first direction.

Wherein, the number of the texts to be processed is at least two, and the at least two texts to be processed belong to the same application scenario, wherein the at least two texts to be processed include a first text to be processed and a second text to be processed , The processor 501 may also be used for:

After the word segmentation result of the first text to be processed is obtained, the word segmentation result of the second text to be processed is determined according to the word segmentation result of the first text to be processed.

It should be noted that the execution steps of the processor in the terminal 50 in the embodiment of the present application can refer to the specific implementation of the terminal operation in the embodiments of FIG. 1 to FIG. 2 in the foregoing method embodiments, and details are not described herein again.

In specific implementation, the terminal 50 may include a mobile phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), a mobile Internet device (Mobile Internet Device, MID), smart wearable devices (such as smart watches, smart bracelets), etc. Various devices that can be used by users are not specifically limited in the embodiment of this application.

The embodiment of the present application also provides a computer-readable storage medium for storing computer software instructions used by the terminal shown in FIG. 1 to FIG. 2, which includes a program for executing the above method embodiment. By executing the stored program, accurate word segmentation of the text to be processed can be achieved.

An embodiment of the present application also provides a computer program, the computer program includes program instructions that, when executed by a processor, cause the processor to perform the method of the first aspect (FIG. 1 to FIG. 2).

Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, optical storage, etc.) containing computer-usable program codes.

This application is described with reference to flowcharts and/or block diagrams of methods, equipment (systems), and computer program products according to the embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.

Obviously, those skilled in the art can make various changes and modifications to the application without departing from the spirit and scope of the application. In this way, if these modifications and variations of this application fall within the scope of the claims of this application and their equivalent technologies, this application also intends to include these modifications and variations.

Claims

A method for text segmentation, characterized in that it includes:

Get the text to be processed;

Perform word segmentation on the to-be-processed text along the first direction according to the word segmentation strategy for string matching to obtain a first word segmentation result;

Perform word segmentation on the to-be-processed text in a second direction according to the word segmentation strategy matched by the character string to obtain a second word segmentation result;

If the first word segmentation result is consistent with the second word segmentation result, output the first word segmentation result or the second word segmentation result.
The method of claim 1, wherein the method further comprises:

If the first word segmentation result is inconsistent with the second word segmentation result, a dynamic programming algorithm is used to segment the text to be processed to obtain a third word segmentation result.
The method according to claim 2, wherein said segmenting the text to be processed through a dynamic programming algorithm to obtain a third segmentation result comprises:

Split the to-be-processed text to obtain multiple individual characters;

A directed acyclic graph is constructed according to the relevance of adjacent characters in the multiple individual characters; wherein, the directed acyclic graph includes multiple paths, and each of the multiple paths includes The entry and the weight corresponding to the entry;

Determining the weight sum of all entries on each path in the directed acyclic graph;

Determine the entry on the path with the smallest weight and as the third word segmentation result.
The method according to claim 2, wherein the first word segmentation result is all the entries on the first path in the directed acyclic graph after the word segmentation of the text to be processed by a dynamic programming algorithm, so The second word segmentation result is all the entries on the second path in the directed acyclic graph;

The segmentation of the text to be processed by the dynamic programming algorithm to obtain the third segmentation result includes:

Respectively determine the weight sum of all the entries on the first path and the weight sum of all the entries on the second path;

If the weight sum of all entries on the first path is less than the weight sum of all entries on the second path, the first word segmentation result is determined as the third word segmentation result.
The method according to claim 4, wherein the method further comprises:

If the weight sum of all entries on the first path is greater than the weight sum of all entries on the second path, the second word segmentation result is determined as the third word segmentation result.
The method according to claim 1, wherein the segmenting the to-be-processed text along the first direction according to the word segmentation strategy of character string matching to obtain the first segmentation result comprises:

Determine the first character of the text to be processed according to the first direction;

The first character is taken as the current character, and the entry consisting of the current character and the M characters adjacent to it is matched with entries in a preset dictionary library in a matching manner to obtain the beginning of the current character To obtain the first word segmentation result; where M is greater than or equal to 1 and less than or equal to Q, and the Q is the number of characters in the text to be processed.
The method according to claim 6, wherein the entries in the preset dictionary library are arranged in order of weight; the first character is used as the current character, and the The entry consisting of the current word and its adjacent M characters is matched with the entry in the preset dictionary library to obtain the entry at the beginning of the current character to obtain the first word segmentation result, including:

The first character is taken as the current character, and the entry consisting of the current character and the adjacent M characters is matched with entries greater than the preset weight in the preset dictionary library in a matching manner to obtain all State the entry at the beginning of the current character to obtain the first word segmentation result.
The method according to any one of claims 1-7, wherein the second direction is the opposite direction of the first direction.
The method according to claim 1, wherein the number of the texts to be processed is at least two, and the at least two texts to be processed belong to the same application scenario, wherein the at least two texts to be processed comprise The first text to be processed and the second text to be processed, the method further includes:

After the word segmentation result of the first text to be processed is obtained, the word segmentation result of the second text to be processed is determined according to the word segmentation result of the first text to be processed.
A text word segmentation device, characterized in that it comprises:

The obtaining unit is used to obtain the text to be processed;

The first word segmentation unit is configured to segment the to-be-processed text in the first direction according to the word segmentation strategy for string matching to obtain the first word segmentation result;

The second word segmentation unit is configured to segment the to-be-processed text in the second direction according to the word segmentation strategy matched by the character string to obtain a second word segmentation result;

The output unit is configured to output the first word segmentation result or the second word segmentation result when the first word segmentation result is consistent with the second word segmentation result.
A terminal, characterized by comprising a processor configured to call stored program instructions, wherein:

The processor is used to obtain the text to be processed;

The processor is further configured to perform word segmentation on the to-be-processed text in the first direction according to the word segmentation strategy for string matching to obtain a first word segmentation result;

The processor is further configured to perform word segmentation on the to-be-processed text in a second direction according to the word segmentation strategy matched by the character string to obtain a second word segmentation result;

The processor is further configured to output the first word segmentation result or the second word segmentation result when the first word segmentation result is consistent with the second word segmentation result.
The terminal according to claim 11, wherein the processor is further configured to:

When the first word segmentation result is inconsistent with the second word segmentation result, the text to be processed is segmented through a dynamic programming algorithm to obtain a third word segmentation result.
The terminal according to claim 12, wherein the processor is specifically configured to:

Split the to-be-processed text to obtain multiple individual characters;

A directed acyclic graph is constructed according to the relevance of adjacent characters in the multiple individual characters; wherein, the directed acyclic graph includes multiple paths, and each of the multiple paths includes The entry and the weight corresponding to the entry;

Determining the weight sum of all entries on each path in the directed acyclic graph;

Determine the entry on the path with the smallest weight and as the third word segmentation result.
The terminal according to claim 12, wherein the first word segmentation result is all the entries on the first path in the directed acyclic graph after the word segmentation of the text to be processed by a dynamic programming algorithm, so The second word segmentation result is all entries on the second path in the directed acyclic graph; the processor is also specifically configured to:

Respectively determine the weight sum of all the entries on the first path and the weight sum of all the entries on the second path;

In the case that the weight sum of all the entries on the first path is less than the weight sum of all the entries on the second path, the first word segmentation result is determined as the third word segmentation result.
The terminal according to claim 14, wherein the processor is further configured to:

In the case that the weight sum of all entries on the first path is greater than the weight sum of all entries on the second path, the second word segmentation result is determined as the third word segmentation result.
The terminal according to claim 11, wherein the processor is specifically configured to:

Determine the first character of the text to be processed according to the first direction;

The first character is taken as the current character, and the entry consisting of the current character and the M characters adjacent to it is matched with entries in a preset dictionary library in a matching manner to obtain the beginning of the current character To obtain the first word segmentation result; where M is greater than or equal to 1 and less than or equal to Q, and the Q is the number of characters in the text to be processed.
The terminal according to claim 16, wherein the entries in the preset dictionary library are arranged in order of weight; the processor is specifically configured to:

The first character is taken as the current character, and the entry consisting of the current character and the adjacent M characters is matched with entries greater than the preset weight in the preset dictionary library in a matching manner to obtain all State the entry at the beginning of the current character to obtain the first word segmentation result.
The terminal according to any one of claims 11-17, wherein the second direction is a direction opposite to the first direction.
The terminal according to claim 11, wherein the number of the text to be processed is at least two, and the at least two texts to be processed belong to the same application scenario, wherein the at least two texts to be processed include For the first text to be processed and the second text to be processed, the processor is further configured to:

After the word segmentation result of the first text to be processed is obtained, the word segmentation result of the second text to be processed is determined according to the word segmentation result of the first text to be processed.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to execute The method described in any one of 1-9 is required.