CN110945514A

CN110945514A - System and method for segmenting sentences

Info

Publication number: CN110945514A
Application number: CN201780093452.0A
Authority: CN
Inventors: 白洁; 李秀林
Original assignee: Beijing Didi Infinity Technology and Development Co Ltd
Current assignee: Beijing Didi Infinity Technology and Development Co Ltd
Priority date: 2017-07-31
Filing date: 2017-07-31
Publication date: 2020-03-31
Anticipated expiration: 2037-07-31
Also published as: US11132506B2; TW201911289A; WO2019023893A1; TWI676167B; US20200160000A1; EP3642733A1; CN110945514B; EP3642733A4

Abstract

Embodiments of the present application provide systems and methods for segmenting sentences. The method may include identifying a first phrase in a sentence that is related to a first segmentation path, determining a first set of derived phrases that are semantically related to the first phrase, determining a first evaluation score based on a modified sentence generated by replacing the first phrase with the corresponding derived phrases in the first set, and segmenting the sentence based on the first evaluation score.

Description

System and method for segmenting sentences

Technical Field

The present application relates to text-to-speech (TS) techniques, and more particularly, to segmenting text sentences.

Background

Text-to-speech (TS) technology can transcribe textual information into audio signals. For example, in a navigation application (e.g., a DiDi app), text information such as traffic conditions, addresses, etc. may be presented to the user by voice. Books and news can also be read to the user by TS technology.

In order to read in a natural way, a piece of text (e.g. a sentence) must be correctly segmented before being transcribed into an audio signal. Typically, each phrase in a sentence contains one or more words. Consistent with the present application, a word may be a word in english, french, spanish, etc. in latin or a character such as chinese, korean, japanese, etc. in asian languages. These words or characters may be segmented into various possible combinations of phrases. In example I, The sentence of "The man over other iswatching TV" can be split according to The following two splitting paths:

first division path: "The man/over/heat is/watching TV".

Second split path: "The man/over heat/is/watching TV".

When generating an audio signal based on the first segmentation path, the transcription may not make sense to the user. However, conventional segmentation systems and methods cannot determine which segmentation path is better because each phrase in the two segmentation paths seems to be linguistically reasonable.

To address the above-mentioned problems, embodiments of the present application provide improved systems and methods for segmenting sentences.

Disclosure of Invention

One aspect of the present application relates to a method for segmenting a sentence. The method may include identifying, by a processor, a first phrase in a sentence that is related to a first segmentation path; determining, by the processor, a first set of derived phrases semantically related to the first phrase; determining, by the processor, a first evaluation score based on a modified sentence generated by replacing the first phrase with the respective derived phrase in the first group; and segmenting the sentence based on the first evaluation score.

Another aspect of the present application relates to a system for segmenting sentences. The system may include a communication interface for receiving the sentence; a memory for storing sentences and language models; and a processor for identifying the first phrase in a sentence that is associated with a first segmentation path; determining a first set of derived phrases semantically related to the first phrase; determining a first evaluation score based on a modified sentence generated by replacing the first phrase with the corresponding derived phrase in the first group; and segmenting the sentence based on the first evaluation score.

Yet another aspect of the present application relates to a non-transitory computer-readable medium storing a set of instructions that, when executed by at least one processor of a segmentation device, cause the segmentation device to perform a method for segmenting a sentence, the method may include: identifying a first phrase in the sentence that is associated with the first segmentation path; determining a first set of derived phrases semantically related to the first phrase; determining, by a processor, a first evaluation score based on a modified sentence generated by replacing the first phrase with the respective derived phrase in the first group; and segmenting the sentence based on the first evaluation score.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.

Drawings

Fig. 1 is a block diagram of an exemplary system for segmenting sentences according to some embodiments of the present application.

FIG. 2 is two exemplary segmentation paths of a Chinese sentence according to some embodiments of the present application.

Fig. 3 is a flow diagram of an exemplary method for segmenting a sentence according to some embodiments of the present application.

Detailed Description

Reference will now be made in detail to exemplary embodiments and the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.

Fig. 1 is a block diagram of an exemplary system 100 for segmenting sentences according to some embodiments of the present application.

The system 100 may be a general-purpose server or a proprietary device for processing textual information in sentences. As shown in fig. 1, system 100 may include a communication interface 102, a processor 104, and a memory 116. The processor 104 may also include a number of functional modules, such as a tokenizer 106, a phrase recognizer 108, a derived phrase generator 110, a score determination unit 112, and a segmentation unit 114. These modules (and any corresponding sub-modules or sub-units) may be functional hardware units (e.g., portions of an integrated circuit) of the processor 104 designed for use with other components or portions of a program. The program may be stored in a computer readable medium and when executed by the processor 104, may perform one or more functions. Although FIG. 1 shows the units 106 and 114 as being entirely within one processor 104, it will be appreciated that these units may be distributed among multiple processors that are located proximate to each other or remote from each other. The system 100 may be implemented in the cloud or in a separate computer/server.

The communication interface 102 may be used to receive one or more sentences 120. The memory 116 may be used to store one or more sentences. The memory 116 may be implemented as any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, or a magnetic or optical disk.

According to an embodiment of the present application, the processor 104 may identify a first phrase in the sentence received by the communication interface 102. The first phrase is associated with a first segmentation path.

For example, tokenizer 106 may segment a sentence in one or more segmentation paths. As described in example I, "The manover ther is fetching TV" may be split into a first split path of "The man/over/her is/fetching" or a second split path of "The man/over her/is/fetching TV". The first phrase "hereis" is associated with the first segmentation path and the second phrase "over there" is associated with the second segmentation path. The phrases "ther is" and "over ther" may be identified by comparing the first and second segmentation paths and locating the difference between the segmentation paths. Referring to example I, the phrase identifier 108 may identify that the splits "over/other" and "over other/is" are different by comparing the first and second split paths. The phrase identifier 108 may also identify "heat is" as a first phrase associated with the first segmentation path and "over heat" as a second phrase associated with the second segmentation path. It is contemplated that the first and second segmentations are different segmentations over the same portion of the sentence. Thus, the first and second phrases have at least one word (e.g., "there") that overlaps.

The derived phrase generator 110 may then determine a first set of derived phrases that are semantically related to the first phrase. The derived phrase may be determined based on a semantic vector between the first phrase and the candidate phrase. The candidate phrases may be pre-stored in a phrase database, and each phrase in the database may be compared to the first phrase based on the semantic vector. Semantic vectors are mathematical models used to represent text documents (e.g., phrases) as vectors. In general, the difference between two semantic vectors associated with a phrase may be represented by the cosine distance between the two semantic vectors. In some embodiments, the first set of derived phrases may include synonyms of the first phrase. As shown in fig. 1, communication interface 102 may be further used to retrieve synonyms corresponding to the phrases from phrase database 122. For example, the first phrase "heres" may have several synonyms, such as "hereare," heres, "exist," have, "has," and the like. In some embodiments, the processor 104 may also determine a second set of derived phrases that are semantically related to the second phrase. The second set of derivative phrases associated with the second phrase "over there" may contain synonyms such as "heat", "over here", and the like.

Given the first set of derived phrases, score determination unit 112 may replace the first phrase with the corresponding derived phrase in the first set and determine a first evaluation score based on the modified sentence. For example, The sentence "The man/over/heat is/watching TV" may be modified by replacing The first phrase "heat is" with The synonyms "heat are", "her is", "exist", "have", "has", and The like. Accordingly, a plurality of modified sentences may be generated, including "The man/over/heat are/watching TV", "The man/over/here/watching TV", "The man/over/exist/watching TV", "The man/over/have/watching TV", "The man/over/has/watching TV", and The like. In some embodiments, a language model score may be determined for each modified sentence using the language model, and the language model scores may be averaged to derive the first evaluation score. The language model may evaluate the segmentation path according to natural language rules. In some embodiments, The modified sentences "The man/over/heat are/watching TV", "The man/over/her is/watching TV", "The man/over/exist/watching TV", "The man/over/have/watching TV" may be rated 47, 68, 35, 33 and 42 respectively. Thus, the first evaluation score may be an average of the above scores, i.e., 45 points. As an example, for The modified sentence "The man/over/there are/watching TV", The language model may determine that The singular subject using The plural verbs does not comply with The natural language rules, and thus The evaluation of The modified sentence score is low. It will be appreciated that a low score is not limiting as to whether a split path is inappropriate or appropriate.

The original sentence comprising the first phrase may also be evaluated by the language model to generate a first base score. For example, The original sentence "The man/over/there is/watching TV" can be evaluated as 68 points, which can be taken as The first base point.

Similarly, the score determination unit 112 may replace the second phrase (e.g., "over ther") with the corresponding derivative phrase in the second group and determine a second evaluation score based on the modified sentence. For example, the second evaluation score may be determined to be 67 points and the second base score may be determined to be 69 points.

The language model may be trained for a specified language such as English, Chinese, Japanese, and the like. For the sentence in example I above, the english language model may be used. The language model may be a generic model pre-stored in the memory 116 or a model trained for a particular domain (e.g., novel, legal, navigational, etc.) through, for example, machine learning.

It will be appreciated that when synonyms of phrases are used to modify the phrases of sentences within the appropriate segmentation path, the modified sentences should still be able to provide similar meaning and follow natural language rules. That is, when the language model evaluates the segmentation path based on the original sentence and the modified sentence, the language model scores based on the original sentence and the modified sentence should be close. On the other hand, when a phrase in an inappropriate segmentation path is replaced by a synonym for that phrase, the modified sentence may have a distinct meaning or even not conform to natural language rules. That is, the difference between the language model scores based on the original sentence and the modified sentence may be amplified so that the processor 104 may determine that the segmentation path is not appropriate. In some embodiments, the threshold may be predetermined, such as 5 points. When the difference is greater than or equal to the threshold, the corresponding segmentation path may be determined as an unsuitable segmentation path. It will be appreciated that the predetermined difference may be different for different language models.

The segmentation unit 114 may further segment the sentence based on the evaluation score (e.g., the first and/or second evaluation scores). For example, from the first segmentation path, a difference between a first base score (e.g., 68 points) and a first evaluation score (e.g., 45 points) may be determined. The difference is 68-45 points 23, which is greater than the threshold (e.g., 5 points). Accordingly, the processor 104 may determine that the first segmentation path is an unsuitable segmentation path. On the other hand, according to the second division path, a difference between a second base score (e.g., 69 points) and a second evaluation score (e.g., 67 points) may be determined. The difference is 69-67 cents, 2 cents, less than a threshold (e.g., 5 cents). Accordingly, the segmentation unit 114 may determine that the second segmentation path is a suitable segmentation path and segment the sentence according to the second segmentation path.

As described above, a first base score (e.g., 68) corresponding to a first segmentation path and a second base score (e.g., 69) corresponding to a second segmentation path are close, and selecting one segmentation path instead of the other based on these base scores may sometimes lead to errors. That is, a base score of 69 does not necessarily indicate that the corresponding segmentation path is better than the segmentation path having a base score of 68. However, by replacing the identified phrases in the first and second segmentation paths with synonyms, the difference between the first and second segmentation paths may be amplified so that the segmentation unit 114 may more accurately select one path from the plurality of paths based on the evaluation score associated with the corresponding path.

According to an embodiment of the present application, the segmentation unit 114 may segment the sentence based on a difference between the first evaluation score and the second evaluation score. For example, as described above, the first evaluation score and the second evaluation score are 45 points and 67 points, respectively. Accordingly, the segmentation unit 114 may select one of the first and second segmentation paths (e.g., the second segmentation path) having a higher evaluation score and segment the sentence according to the selected segmentation path.

It will be appreciated that all of the scores and thresholds described above are merely illustrative and may be modified if desired.

In addition to example I in english, the system 100 can process a text sentence in another language, such as chinese. FIG. 2 further illustrates two exemplary segmentation paths of a Chinese sentence according to some embodiments of the present application. The Chinese sentence means that Florence is powerful and Ewing is in the mean of nail. The pinyin is marked under each Chinese character of the sentence in the original text.

As shown in fig. 2, in example II, a chinese sentence may be segmented according to first and second segmentation paths. The first and second segmentation paths may be compared by tokenizer 106 to identify that phrase 202 and phrase 204 are different. The phrase identifier 108 may then identify the phrase corresponding to the "toenail" in the first segmentation path as the first phrase associated with the first segmentation path, and the phrase "toenail" represents "italian level league" in chinese. The phrase identifier 108 may also identify a phrase corresponding to "in-meaning" in the second segmentation path, and the phrase "in-meaning" means "care," "care," and the like. It is noted that the phrase "meaning first" and the phrase "meaning" both have the same meaning of the Chinese character "meaning" at the same time.

The derived phrase generator 110 may determine a first set of derived phrases and a second set of derived phrases that are semantically related to the first phrase and the second phrase, respectively. For example, derived phrases semantically related to the first phrase "nail-meaning" may include "western nail", "english-super", "german nail", and the like. Derived phrases semantically related to the second phrase "in-meaning" may include "care," "in-care," "worry," and the like.

Modified sentences may be generated by replacing the first and second phrases with corresponding derivative phrases, respectively. Then, based on the original sentence and the modified sentence, the score determining unit 112 may evaluate the first and second division paths using a language model of chinese. For example, the first and second segmentation paths based on the original sentence may be scored as 58.342 and 59.081, respectively, and the first and second segmentation paths based on the modified sentence may be scored as 58.561 and 34.952, respectively.

Based on the evaluation score, the segmentation unit 114 may select a first segmentation path (e.g., 58.561 instead of 34.952) having a higher evaluation score and segment the sentence accordingly. Note that in this example, selecting a split path based on the base score may result in an error because the second split path actually has a slightly higher base score.

The system 100 described above can amplify (and sometimes reverse) the difference between two or more segmentation paths by replacing recognized phrases with synonyms and select an appropriate path for segmenting a sentence when the language model cannot distinguish between them.

Another aspect of the present application relates to a method for segmenting a sentence. Fig. 3 is a flow diagram of an exemplary method 300 for segmenting a sentence according to some embodiments of the present application. For example, the method 300 may be implemented by a segmentation device and may include steps S302-S308.

In step S302, the segmentation apparatus may identify a first phrase in the sentence that is associated with the first segmentation path. In some embodiments, the segmentation device may generate at least two segmentation paths of the sentence. Each of the segmentation paths may include a plurality of phrases. The segmentation device may compare the plurality of phrases and identify phrases in the first segmentation path that are different from corresponding phrases in other segmentation paths. Similarly, the segmentation device may identify a phrase in the second segmentation path. Further, the segmentation device may identify a first phrase in the sentence that is associated with the first segmentation path and a second phrase in the sentence that is associated with the second segmentation path. Since the first and second segmentation paths are different segmentation paths of the same sentence or the same part of a sentence, the first and second phrases may have at least one overlapping word.

In step S304, the segmentation apparatus may determine a first set of derived phrases semantically related to the first phrase. The derived phrase may be determined based on a semantic vector between the first phrase and the candidate phrase. The candidate phrases may be pre-stored in a phrase database, and each phrase in the database may be compared to the first phrase based on the semantic vector. In some embodiments, the first set of derived phrases may include synonyms of the first phrase. Similarly, the segmentation device may determine a second set of derived phrases that are semantically related to the second phrase.

In step S306, the segmentation apparatus may determine a first evaluation score based on the modified sentence generated by replacing the first phrase with each derived phrase in the first group. A language model score may be determined for each modified sentence using the language model, and the language model scores may be averaged to derive a first evaluation score. The language model may evaluate the segmentation path according to natural language rules. The segmentation means may further determine the second evaluation score based on a modified sentence generated by replacing the second phrase with the corresponding derived phrase in the second group. A first base score and a second base score may also be determined based on the sentence. For example, first and second phrases of a sentence may be scored by a language model to generate first and second base scores. It is to be understood that the average score is merely an example of evaluating the segmentation path. The individual scores may be manipulated or combined in any suitable manner to arrive at an evaluation score. For example, instead of a direct average of the individual scores, the evaluation score may be a weighted average of the individual scores, and the weights may correspond to the proximity of the individual synonyms to the phrase. As another example, a change in a single language model score of a modified sentence may be used to determine whether the language model score has changed significantly. If the language model score changes significantly, it may indicate that the corresponding segmentation path is not appropriate.

In step S308, the segmentation apparatus may segment the sentence based on the first evaluation score. In some embodiments, the segmentation device may compare the first evaluation score with the first base score and segment the sentence according to the first segmentation path when a difference between the first base score and the first evaluation score is less than a threshold. As described above, when a sentence is properly segmented, the evaluation score based on the modified sentence should be close to the base score based on the original sentence. In other embodiments, the segmentation device may segment the sentence based on a difference between the first evaluation score and the second evaluation score. For example, the segmentation apparatus may select one of the first and second segmentation paths having a higher evaluation score, and segment the sentence according to the selected segmentation path.

Yet another aspect of the application relates to a non-transitory computer-readable medium storing instructions that, when executed, cause one or more processors to perform the method, as described above. The computer-readable medium may include volatile or nonvolatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer-readable medium or computer-readable storage device. For example, as disclosed, the computer-readable medium may be a storage device or memory module having stored thereon computer instructions. In some embodiments, the computer readable medium may be a disk or flash drive having computer instructions stored thereon.

It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed segmentation system and associated methods. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed system and associated method. Although the embodiments are described with respect to two split paths as an example, the described splitting systems and methods may be applied to more than two split paths.

It is intended that the specification and examples be considered as exemplary only, with a true scope being indicated by the following claims and their equivalents.

Claims

1. A computer-implemented method for segmenting a sentence, comprising:

identifying, by a processor, a first phrase in the sentence that is related to the first segmentation path;

determining, by the processor, a first set of derived phrases semantically related to the first phrase;

determining, by the processor, a first evaluation score based on a modified sentence generated by replacing the first phrase with the corresponding derived phrase in the first group; and

segmenting the sentence based on the first evaluation score.

2. The method of claim 1, wherein determining the first evaluation score comprises:

determining a language model score for each modified sentence; and

averaging the language model scores to derive the first evaluation score.

3. The method of claim 2, wherein the language model score is generated using a language model according to natural language rules.

4. The method of claim 1, further comprising:

determining a first base score based on the sentence comprising the first phrase; and

segmenting the sentence based on a difference between the first base score and the first evaluation score.

5. The method of claim 4, wherein the sentence is segmented according to the first segmentation path when a difference between the first base score and the first evaluation score is less than a threshold.

6. The method of claim 1, further comprising:

identifying a second phrase in the sentence that is related to the second segmentation path;

determining a second set of derived phrases semantically related to the second phrase;

determining a second evaluation score based on a modified sentence generated by replacing the second phrase with the corresponding derived phrase in the second group; and

segmenting the sentence based on a difference between the first evaluation score and the second evaluation score.

7. The method of claim 6, wherein the second phrase has at least one overlapping word with the first phrase.

8. The method of claim 6, further comprising:

identifying the first phrase and the second phrase based on a difference between the first segmentation path and the second segmentation path.

9. The method of claim 6, further comprising:

selecting one of the first segmentation path and the second segmentation path having a higher evaluation score; and

and segmenting the sentence according to the selected segmentation path.

10. The method of claim 1, wherein the derived phrase is determined based on a semantic vector between the first phrase and a candidate phrase.

11. A system for segmenting sentences, comprising:

a communication interface for receiving the sentence;

a memory for storing sentences and language models; and

the processor is used for

Identifying the first phrase in a sentence that is related to a first segmentation path;

determining a first set of derived phrases semantically related to the first phrase;

determining a first evaluation score based on a modified sentence generated by replacing the first phrase with the corresponding derived phrase in the first group; and

segmenting the sentence based on the first evaluation score.

12. The system of claim 11, wherein the processor is further configured to:

determining a language model score for each modified sentence; and

averaging the language model scores to derive the first evaluation score.

13. The system of claim 12, wherein the language model score is generated using a natural language rules based language model.

14. The system of claim 11, wherein the processor is further configured to:

15. The system of claim 14, wherein the sentence is segmented according to the first segmentation path when a difference between the first base score and the first evaluation score is less than a threshold.

16. The system of claim 11, wherein the processor is further configured to:

17. The system of claim 16, wherein the second phrase has at least one overlapping word with the first phrase.

18. The system of claim 16, wherein the processor is further configured to:

19. The system of claim 16, wherein the processor is further configured to:

and segmenting the sentence according to the selected segmentation path.

20. A non-transitory computer-readable medium storing a set of instructions that, when executed by at least one processor of a segmentation device, cause the segmentation device to perform a method for segmenting a sentence, the method comprising:

identifying a first phrase in the sentence that is associated with the first segmentation path;

determining, by the processor, a first evaluation score based on a modified sentence generated by replacing the first phrase with the respective derived phrase in the first group; and

segmenting the sentence based on the first evaluation score.