CN110945514A - System and method for segmenting sentences - Google Patents
System and method for segmenting sentences Download PDFInfo
- Publication number
- CN110945514A CN110945514A CN201780093452.0A CN201780093452A CN110945514A CN 110945514 A CN110945514 A CN 110945514A CN 201780093452 A CN201780093452 A CN 201780093452A CN 110945514 A CN110945514 A CN 110945514A
- Authority
- CN
- China
- Prior art keywords
- phrase
- sentence
- evaluation score
- segmentation
- score
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 230000011218 segmentation Effects 0.000 claims abstract description 111
- 238000011156 evaluation Methods 0.000 claims abstract description 59
- 239000013598 vector Substances 0.000 claims description 9
- 238000004891 communication Methods 0.000 claims description 6
- 238000012935 Averaging Methods 0.000 claims 2
- 238000010586 diagram Methods 0.000 description 4
- 210000000282 nail Anatomy 0.000 description 3
- 230000005236 sound signal Effects 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 210000004906 toe nail Anatomy 0.000 description 2
- 230000006870 function Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000013518 transcription Methods 0.000 description 1
- 230000035897 transcription Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/247—Thesauruses; Synonyms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/53—Processing of non-Latin text
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
Embodiments of the present application provide systems and methods for segmenting sentences. The method may include identifying a first phrase in a sentence that is related to a first segmentation path, determining a first set of derived phrases that are semantically related to the first phrase, determining a first evaluation score based on a modified sentence generated by replacing the first phrase with the corresponding derived phrases in the first set, and segmenting the sentence based on the first evaluation score.
Description
Technical Field
The present application relates to text-to-speech (TS) techniques, and more particularly, to segmenting text sentences.
Background
Text-to-speech (TS) technology can transcribe textual information into audio signals. For example, in a navigation application (e.g., a DiDi app), text information such as traffic conditions, addresses, etc. may be presented to the user by voice. Books and news can also be read to the user by TS technology.
In order to read in a natural way, a piece of text (e.g. a sentence) must be correctly segmented before being transcribed into an audio signal. Typically, each phrase in a sentence contains one or more words. Consistent with the present application, a word may be a word in english, french, spanish, etc. in latin or a character such as chinese, korean, japanese, etc. in asian languages. These words or characters may be segmented into various possible combinations of phrases. In example I, The sentence of "The man over other iswatching TV" can be split according to The following two splitting paths:
first division path: "The man/over/heat is/watching TV".
Second split path: "The man/over heat/is/watching TV".
When generating an audio signal based on the first segmentation path, the transcription may not make sense to the user. However, conventional segmentation systems and methods cannot determine which segmentation path is better because each phrase in the two segmentation paths seems to be linguistically reasonable.
To address the above-mentioned problems, embodiments of the present application provide improved systems and methods for segmenting sentences.
Disclosure of Invention
One aspect of the present application relates to a method for segmenting a sentence. The method may include identifying, by a processor, a first phrase in a sentence that is related to a first segmentation path; determining, by the processor, a first set of derived phrases semantically related to the first phrase; determining, by the processor, a first evaluation score based on a modified sentence generated by replacing the first phrase with the respective derived phrase in the first group; and segmenting the sentence based on the first evaluation score.
Another aspect of the present application relates to a system for segmenting sentences. The system may include a communication interface for receiving the sentence; a memory for storing sentences and language models; and a processor for identifying the first phrase in a sentence that is associated with a first segmentation path; determining a first set of derived phrases semantically related to the first phrase; determining a first evaluation score based on a modified sentence generated by replacing the first phrase with the corresponding derived phrase in the first group; and segmenting the sentence based on the first evaluation score.
Yet another aspect of the present application relates to a non-transitory computer-readable medium storing a set of instructions that, when executed by at least one processor of a segmentation device, cause the segmentation device to perform a method for segmenting a sentence, the method may include: identifying a first phrase in the sentence that is associated with the first segmentation path; determining a first set of derived phrases semantically related to the first phrase; determining, by a processor, a first evaluation score based on a modified sentence generated by replacing the first phrase with the respective derived phrase in the first group; and segmenting the sentence based on the first evaluation score.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
Fig. 1 is a block diagram of an exemplary system for segmenting sentences according to some embodiments of the present application.
FIG. 2 is two exemplary segmentation paths of a Chinese sentence according to some embodiments of the present application.
Fig. 3 is a flow diagram of an exemplary method for segmenting a sentence according to some embodiments of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments and the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.
Fig. 1 is a block diagram of an exemplary system 100 for segmenting sentences according to some embodiments of the present application.
The system 100 may be a general-purpose server or a proprietary device for processing textual information in sentences. As shown in fig. 1, system 100 may include a communication interface 102, a processor 104, and a memory 116. The processor 104 may also include a number of functional modules, such as a tokenizer 106, a phrase recognizer 108, a derived phrase generator 110, a score determination unit 112, and a segmentation unit 114. These modules (and any corresponding sub-modules or sub-units) may be functional hardware units (e.g., portions of an integrated circuit) of the processor 104 designed for use with other components or portions of a program. The program may be stored in a computer readable medium and when executed by the processor 104, may perform one or more functions. Although FIG. 1 shows the units 106 and 114 as being entirely within one processor 104, it will be appreciated that these units may be distributed among multiple processors that are located proximate to each other or remote from each other. The system 100 may be implemented in the cloud or in a separate computer/server.
The communication interface 102 may be used to receive one or more sentences 120. The memory 116 may be used to store one or more sentences. The memory 116 may be implemented as any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, or a magnetic or optical disk.
According to an embodiment of the present application, the processor 104 may identify a first phrase in the sentence received by the communication interface 102. The first phrase is associated with a first segmentation path.
For example, tokenizer 106 may segment a sentence in one or more segmentation paths. As described in example I, "The manover ther is fetching TV" may be split into a first split path of "The man/over/her is/fetching" or a second split path of "The man/over her/is/fetching TV". The first phrase "hereis" is associated with the first segmentation path and the second phrase "over there" is associated with the second segmentation path. The phrases "ther is" and "over ther" may be identified by comparing the first and second segmentation paths and locating the difference between the segmentation paths. Referring to example I, the phrase identifier 108 may identify that the splits "over/other" and "over other/is" are different by comparing the first and second split paths. The phrase identifier 108 may also identify "heat is" as a first phrase associated with the first segmentation path and "over heat" as a second phrase associated with the second segmentation path. It is contemplated that the first and second segmentations are different segmentations over the same portion of the sentence. Thus, the first and second phrases have at least one word (e.g., "there") that overlaps.
The derived phrase generator 110 may then determine a first set of derived phrases that are semantically related to the first phrase. The derived phrase may be determined based on a semantic vector between the first phrase and the candidate phrase. The candidate phrases may be pre-stored in a phrase database, and each phrase in the database may be compared to the first phrase based on the semantic vector. Semantic vectors are mathematical models used to represent text documents (e.g., phrases) as vectors. In general, the difference between two semantic vectors associated with a phrase may be represented by the cosine distance between the two semantic vectors. In some embodiments, the first set of derived phrases may include synonyms of the first phrase. As shown in fig. 1, communication interface 102 may be further used to retrieve synonyms corresponding to the phrases from phrase database 122. For example, the first phrase "heres" may have several synonyms, such as "hereare," heres, "exist," have, "has," and the like. In some embodiments, the processor 104 may also determine a second set of derived phrases that are semantically related to the second phrase. The second set of derivative phrases associated with the second phrase "over there" may contain synonyms such as "heat", "over here", and the like.
Given the first set of derived phrases, score determination unit 112 may replace the first phrase with the corresponding derived phrase in the first set and determine a first evaluation score based on the modified sentence. For example, The sentence "The man/over/heat is/watching TV" may be modified by replacing The first phrase "heat is" with The synonyms "heat are", "her is", "exist", "have", "has", and The like. Accordingly, a plurality of modified sentences may be generated, including "The man/over/heat are/watching TV", "The man/over/here/watching TV", "The man/over/exist/watching TV", "The man/over/have/watching TV", "The man/over/has/watching TV", and The like. In some embodiments, a language model score may be determined for each modified sentence using the language model, and the language model scores may be averaged to derive the first evaluation score. The language model may evaluate the segmentation path according to natural language rules. In some embodiments, The modified sentences "The man/over/heat are/watching TV", "The man/over/her is/watching TV", "The man/over/exist/watching TV", "The man/over/have/watching TV" may be rated 47, 68, 35, 33 and 42 respectively. Thus, the first evaluation score may be an average of the above scores, i.e., 45 points. As an example, for The modified sentence "The man/over/there are/watching TV", The language model may determine that The singular subject using The plural verbs does not comply with The natural language rules, and thus The evaluation of The modified sentence score is low. It will be appreciated that a low score is not limiting as to whether a split path is inappropriate or appropriate.
The original sentence comprising the first phrase may also be evaluated by the language model to generate a first base score. For example, The original sentence "The man/over/there is/watching TV" can be evaluated as 68 points, which can be taken as The first base point.
Similarly, the score determination unit 112 may replace the second phrase (e.g., "over ther") with the corresponding derivative phrase in the second group and determine a second evaluation score based on the modified sentence. For example, the second evaluation score may be determined to be 67 points and the second base score may be determined to be 69 points.
The language model may be trained for a specified language such as English, Chinese, Japanese, and the like. For the sentence in example I above, the english language model may be used. The language model may be a generic model pre-stored in the memory 116 or a model trained for a particular domain (e.g., novel, legal, navigational, etc.) through, for example, machine learning.
It will be appreciated that when synonyms of phrases are used to modify the phrases of sentences within the appropriate segmentation path, the modified sentences should still be able to provide similar meaning and follow natural language rules. That is, when the language model evaluates the segmentation path based on the original sentence and the modified sentence, the language model scores based on the original sentence and the modified sentence should be close. On the other hand, when a phrase in an inappropriate segmentation path is replaced by a synonym for that phrase, the modified sentence may have a distinct meaning or even not conform to natural language rules. That is, the difference between the language model scores based on the original sentence and the modified sentence may be amplified so that the processor 104 may determine that the segmentation path is not appropriate. In some embodiments, the threshold may be predetermined, such as 5 points. When the difference is greater than or equal to the threshold, the corresponding segmentation path may be determined as an unsuitable segmentation path. It will be appreciated that the predetermined difference may be different for different language models.
The segmentation unit 114 may further segment the sentence based on the evaluation score (e.g., the first and/or second evaluation scores). For example, from the first segmentation path, a difference between a first base score (e.g., 68 points) and a first evaluation score (e.g., 45 points) may be determined. The difference is 68-45 points 23, which is greater than the threshold (e.g., 5 points). Accordingly, the processor 104 may determine that the first segmentation path is an unsuitable segmentation path. On the other hand, according to the second division path, a difference between a second base score (e.g., 69 points) and a second evaluation score (e.g., 67 points) may be determined. The difference is 69-67 cents, 2 cents, less than a threshold (e.g., 5 cents). Accordingly, the segmentation unit 114 may determine that the second segmentation path is a suitable segmentation path and segment the sentence according to the second segmentation path.
As described above, a first base score (e.g., 68) corresponding to a first segmentation path and a second base score (e.g., 69) corresponding to a second segmentation path are close, and selecting one segmentation path instead of the other based on these base scores may sometimes lead to errors. That is, a base score of 69 does not necessarily indicate that the corresponding segmentation path is better than the segmentation path having a base score of 68. However, by replacing the identified phrases in the first and second segmentation paths with synonyms, the difference between the first and second segmentation paths may be amplified so that the segmentation unit 114 may more accurately select one path from the plurality of paths based on the evaluation score associated with the corresponding path.
According to an embodiment of the present application, the segmentation unit 114 may segment the sentence based on a difference between the first evaluation score and the second evaluation score. For example, as described above, the first evaluation score and the second evaluation score are 45 points and 67 points, respectively. Accordingly, the segmentation unit 114 may select one of the first and second segmentation paths (e.g., the second segmentation path) having a higher evaluation score and segment the sentence according to the selected segmentation path.
It will be appreciated that all of the scores and thresholds described above are merely illustrative and may be modified if desired.
In addition to example I in english, the system 100 can process a text sentence in another language, such as chinese. FIG. 2 further illustrates two exemplary segmentation paths of a Chinese sentence according to some embodiments of the present application. The Chinese sentence means that Florence is powerful and Ewing is in the mean of nail. The pinyin is marked under each Chinese character of the sentence in the original text.
As shown in fig. 2, in example II, a chinese sentence may be segmented according to first and second segmentation paths. The first and second segmentation paths may be compared by tokenizer 106 to identify that phrase 202 and phrase 204 are different. The phrase identifier 108 may then identify the phrase corresponding to the "toenail" in the first segmentation path as the first phrase associated with the first segmentation path, and the phrase "toenail" represents "italian level league" in chinese. The phrase identifier 108 may also identify a phrase corresponding to "in-meaning" in the second segmentation path, and the phrase "in-meaning" means "care," "care," and the like. It is noted that the phrase "meaning first" and the phrase "meaning" both have the same meaning of the Chinese character "meaning" at the same time.
The derived phrase generator 110 may determine a first set of derived phrases and a second set of derived phrases that are semantically related to the first phrase and the second phrase, respectively. For example, derived phrases semantically related to the first phrase "nail-meaning" may include "western nail", "english-super", "german nail", and the like. Derived phrases semantically related to the second phrase "in-meaning" may include "care," "in-care," "worry," and the like.
Modified sentences may be generated by replacing the first and second phrases with corresponding derivative phrases, respectively. Then, based on the original sentence and the modified sentence, the score determining unit 112 may evaluate the first and second division paths using a language model of chinese. For example, the first and second segmentation paths based on the original sentence may be scored as 58.342 and 59.081, respectively, and the first and second segmentation paths based on the modified sentence may be scored as 58.561 and 34.952, respectively.
Based on the evaluation score, the segmentation unit 114 may select a first segmentation path (e.g., 58.561 instead of 34.952) having a higher evaluation score and segment the sentence accordingly. Note that in this example, selecting a split path based on the base score may result in an error because the second split path actually has a slightly higher base score.
The system 100 described above can amplify (and sometimes reverse) the difference between two or more segmentation paths by replacing recognized phrases with synonyms and select an appropriate path for segmenting a sentence when the language model cannot distinguish between them.
Another aspect of the present application relates to a method for segmenting a sentence. Fig. 3 is a flow diagram of an exemplary method 300 for segmenting a sentence according to some embodiments of the present application. For example, the method 300 may be implemented by a segmentation device and may include steps S302-S308.
In step S302, the segmentation apparatus may identify a first phrase in the sentence that is associated with the first segmentation path. In some embodiments, the segmentation device may generate at least two segmentation paths of the sentence. Each of the segmentation paths may include a plurality of phrases. The segmentation device may compare the plurality of phrases and identify phrases in the first segmentation path that are different from corresponding phrases in other segmentation paths. Similarly, the segmentation device may identify a phrase in the second segmentation path. Further, the segmentation device may identify a first phrase in the sentence that is associated with the first segmentation path and a second phrase in the sentence that is associated with the second segmentation path. Since the first and second segmentation paths are different segmentation paths of the same sentence or the same part of a sentence, the first and second phrases may have at least one overlapping word.
In step S304, the segmentation apparatus may determine a first set of derived phrases semantically related to the first phrase. The derived phrase may be determined based on a semantic vector between the first phrase and the candidate phrase. The candidate phrases may be pre-stored in a phrase database, and each phrase in the database may be compared to the first phrase based on the semantic vector. In some embodiments, the first set of derived phrases may include synonyms of the first phrase. Similarly, the segmentation device may determine a second set of derived phrases that are semantically related to the second phrase.
In step S306, the segmentation apparatus may determine a first evaluation score based on the modified sentence generated by replacing the first phrase with each derived phrase in the first group. A language model score may be determined for each modified sentence using the language model, and the language model scores may be averaged to derive a first evaluation score. The language model may evaluate the segmentation path according to natural language rules. The segmentation means may further determine the second evaluation score based on a modified sentence generated by replacing the second phrase with the corresponding derived phrase in the second group. A first base score and a second base score may also be determined based on the sentence. For example, first and second phrases of a sentence may be scored by a language model to generate first and second base scores. It is to be understood that the average score is merely an example of evaluating the segmentation path. The individual scores may be manipulated or combined in any suitable manner to arrive at an evaluation score. For example, instead of a direct average of the individual scores, the evaluation score may be a weighted average of the individual scores, and the weights may correspond to the proximity of the individual synonyms to the phrase. As another example, a change in a single language model score of a modified sentence may be used to determine whether the language model score has changed significantly. If the language model score changes significantly, it may indicate that the corresponding segmentation path is not appropriate.
In step S308, the segmentation apparatus may segment the sentence based on the first evaluation score. In some embodiments, the segmentation device may compare the first evaluation score with the first base score and segment the sentence according to the first segmentation path when a difference between the first base score and the first evaluation score is less than a threshold. As described above, when a sentence is properly segmented, the evaluation score based on the modified sentence should be close to the base score based on the original sentence. In other embodiments, the segmentation device may segment the sentence based on a difference between the first evaluation score and the second evaluation score. For example, the segmentation apparatus may select one of the first and second segmentation paths having a higher evaluation score, and segment the sentence according to the selected segmentation path.
Yet another aspect of the application relates to a non-transitory computer-readable medium storing instructions that, when executed, cause one or more processors to perform the method, as described above. The computer-readable medium may include volatile or nonvolatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer-readable medium or computer-readable storage device. For example, as disclosed, the computer-readable medium may be a storage device or memory module having stored thereon computer instructions. In some embodiments, the computer readable medium may be a disk or flash drive having computer instructions stored thereon.
It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed segmentation system and associated methods. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed system and associated method. Although the embodiments are described with respect to two split paths as an example, the described splitting systems and methods may be applied to more than two split paths.
It is intended that the specification and examples be considered as exemplary only, with a true scope being indicated by the following claims and their equivalents.
Claims (20)
1. A computer-implemented method for segmenting a sentence, comprising:
identifying, by a processor, a first phrase in the sentence that is related to the first segmentation path;
determining, by the processor, a first set of derived phrases semantically related to the first phrase;
determining, by the processor, a first evaluation score based on a modified sentence generated by replacing the first phrase with the corresponding derived phrase in the first group; and
segmenting the sentence based on the first evaluation score.
2. The method of claim 1, wherein determining the first evaluation score comprises:
determining a language model score for each modified sentence; and
averaging the language model scores to derive the first evaluation score.
3. The method of claim 2, wherein the language model score is generated using a language model according to natural language rules.
4. The method of claim 1, further comprising:
determining a first base score based on the sentence comprising the first phrase; and
segmenting the sentence based on a difference between the first base score and the first evaluation score.
5. The method of claim 4, wherein the sentence is segmented according to the first segmentation path when a difference between the first base score and the first evaluation score is less than a threshold.
6. The method of claim 1, further comprising:
identifying a second phrase in the sentence that is related to the second segmentation path;
determining a second set of derived phrases semantically related to the second phrase;
determining a second evaluation score based on a modified sentence generated by replacing the second phrase with the corresponding derived phrase in the second group; and
segmenting the sentence based on a difference between the first evaluation score and the second evaluation score.
7. The method of claim 6, wherein the second phrase has at least one overlapping word with the first phrase.
8. The method of claim 6, further comprising:
identifying the first phrase and the second phrase based on a difference between the first segmentation path and the second segmentation path.
9. The method of claim 6, further comprising:
selecting one of the first segmentation path and the second segmentation path having a higher evaluation score; and
and segmenting the sentence according to the selected segmentation path.
10. The method of claim 1, wherein the derived phrase is determined based on a semantic vector between the first phrase and a candidate phrase.
11. A system for segmenting sentences, comprising:
a communication interface for receiving the sentence;
a memory for storing sentences and language models; and
the processor is used for
Identifying the first phrase in a sentence that is related to a first segmentation path;
determining a first set of derived phrases semantically related to the first phrase;
determining a first evaluation score based on a modified sentence generated by replacing the first phrase with the corresponding derived phrase in the first group; and
segmenting the sentence based on the first evaluation score.
12. The system of claim 11, wherein the processor is further configured to:
determining a language model score for each modified sentence; and
averaging the language model scores to derive the first evaluation score.
13. The system of claim 12, wherein the language model score is generated using a natural language rules based language model.
14. The system of claim 11, wherein the processor is further configured to:
determining a first base score based on the sentence comprising the first phrase; and
segmenting the sentence based on a difference between the first base score and the first evaluation score.
15. The system of claim 14, wherein the sentence is segmented according to the first segmentation path when a difference between the first base score and the first evaluation score is less than a threshold.
16. The system of claim 11, wherein the processor is further configured to:
identifying a second phrase in the sentence that is related to the second segmentation path;
determining a second set of derived phrases semantically related to the second phrase;
determining a second evaluation score based on a modified sentence generated by replacing the second phrase with the corresponding derived phrase in the second group; and
segmenting the sentence based on a difference between the first evaluation score and the second evaluation score.
17. The system of claim 16, wherein the second phrase has at least one overlapping word with the first phrase.
18. The system of claim 16, wherein the processor is further configured to:
identifying the first phrase and the second phrase based on a difference between the first segmentation path and the second segmentation path.
19. The system of claim 16, wherein the processor is further configured to:
selecting one of the first segmentation path and the second segmentation path having a higher evaluation score; and
and segmenting the sentence according to the selected segmentation path.
20. A non-transitory computer-readable medium storing a set of instructions that, when executed by at least one processor of a segmentation device, cause the segmentation device to perform a method for segmenting a sentence, the method comprising:
identifying a first phrase in the sentence that is associated with the first segmentation path;
determining a first set of derived phrases semantically related to the first phrase;
determining, by the processor, a first evaluation score based on a modified sentence generated by replacing the first phrase with the respective derived phrase in the first group; and
segmenting the sentence based on the first evaluation score.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2017/095305 WO2019023893A1 (en) | 2017-07-31 | 2017-07-31 | System and method for segmenting a sentence |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110945514A true CN110945514A (en) | 2020-03-31 |
CN110945514B CN110945514B (en) | 2023-08-25 |
Family
ID=65232340
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201780093452.0A Active CN110945514B (en) | 2017-07-31 | 2017-07-31 | System and method for segmenting sentences |
Country Status (5)
Country | Link |
---|---|
US (1) | US11132506B2 (en) |
EP (1) | EP3642733A4 (en) |
CN (1) | CN110945514B (en) |
TW (1) | TWI676167B (en) |
WO (1) | WO2019023893A1 (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3642733A4 (en) * | 2017-07-31 | 2020-07-22 | Beijing Didi Infinity Technology and Development Co., Ltd. | System and method for segmenting a sentence |
CN110134949B (en) * | 2019-04-26 | 2022-10-28 | 网宿科技股份有限公司 | Text labeling method and equipment based on teacher supervision |
US11544466B2 (en) | 2020-03-02 | 2023-01-03 | International Business Machines Corporation | Optimized document score system using sentence structure analysis function |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006018354A (en) * | 2004-06-30 | 2006-01-19 | Advanced Telecommunication Research Institute International | Text division device and natural language processor |
JP2007286901A (en) * | 2006-04-17 | 2007-11-01 | Mitsuyoshi Tsukahara | Sentence analyzing device |
JP2008021093A (en) * | 2006-07-12 | 2008-01-31 | National Institute Of Information & Communication Technology | Sentence conversion processing system, translation processing system having sentence conversion function, voice recognition processing system having sentence conversion function, and speech synthesis processing system having sentence conversion function |
JP2012042991A (en) * | 2010-08-12 | 2012-03-01 | Fuji Xerox Co Ltd | Sentence generation program and sentence generation device |
US20120078612A1 (en) * | 2010-09-29 | 2012-03-29 | Rhonda Enterprises, Llc | Systems and methods for navigating electronic texts |
CN103035243A (en) * | 2012-12-18 | 2013-04-10 | 中国科学院自动化研究所 | Real-time feedback method and system of long voice continuous recognition and recognition result |
CN103544309A (en) * | 2013-11-04 | 2014-01-29 | 北京中搜网络技术股份有限公司 | Splitting method for search string of Chinese vertical search |
US20140156260A1 (en) * | 2012-11-30 | 2014-06-05 | Microsoft Corporation | Generating sentence completion questions |
CN104464757A (en) * | 2014-10-28 | 2015-03-25 | 科大讯飞股份有限公司 | Voice evaluation method and device |
JP2017004179A (en) * | 2015-06-08 | 2017-01-05 | 日本電信電話株式会社 | Information processing method, device, and program |
CN106407235A (en) * | 2015-08-03 | 2017-02-15 | 北京众荟信息技术有限公司 | A semantic dictionary establishing method based on comment data |
Family Cites Families (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7680648B2 (en) * | 2004-09-30 | 2010-03-16 | Google Inc. | Methods and systems for improving text segmentation |
JP4481972B2 (en) * | 2006-09-28 | 2010-06-16 | 株式会社東芝 | Speech translation device, speech translation method, and speech translation program |
CN101261623A (en) * | 2007-03-07 | 2008-09-10 | 国际商业机器公司 | Word splitting method and device for word border-free mark language based on search |
CN102479191B (en) * | 2010-11-22 | 2014-03-26 | 阿里巴巴集团控股有限公司 | Method and device for providing multi-granularity word segmentation result |
TWI501097B (en) * | 2012-12-22 | 2015-09-21 | Ind Tech Res Inst | System and method of analyzing text stream message |
US20140244240A1 (en) * | 2013-02-27 | 2014-08-28 | Hewlett-Packard Development Company, L.P. | Determining Explanatoriness of a Segment |
CN103793491B (en) * | 2014-01-20 | 2017-01-25 | 天津大学 | Chinese news story segmentation method based on flexible semantic similarity measurement |
CN103927358B (en) * | 2014-04-15 | 2017-02-15 | 清华大学 | text search method and system |
TW201715420A (en) * | 2015-10-30 | 2017-05-01 | 元智大學 | Method and system for analysing the weight of text |
US20190179316A1 (en) * | 2016-08-25 | 2019-06-13 | Purdue Research Foundation | System and method for controlling a self-guided vehicle |
US11087068B2 (en) * | 2016-10-31 | 2021-08-10 | Fujifilm Business Innovation Corp. | Systems and methods for bringing document interactions into the online conversation stream |
EP3642733A4 (en) * | 2017-07-31 | 2020-07-22 | Beijing Didi Infinity Technology and Development Co., Ltd. | System and method for segmenting a sentence |
CN110998589B (en) * | 2017-07-31 | 2023-06-27 | 北京嘀嘀无限科技发展有限公司 | System and method for segmenting text |
-
2017
- 2017-07-31 EP EP17920149.6A patent/EP3642733A4/en not_active Withdrawn
- 2017-07-31 CN CN201780093452.0A patent/CN110945514B/en active Active
- 2017-07-31 WO PCT/CN2017/095305 patent/WO2019023893A1/en unknown
-
2018
- 2018-07-25 TW TW107125631A patent/TWI676167B/en active
-
2020
- 2020-01-22 US US16/749,956 patent/US11132506B2/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2006018354A (en) * | 2004-06-30 | 2006-01-19 | Advanced Telecommunication Research Institute International | Text division device and natural language processor |
JP2007286901A (en) * | 2006-04-17 | 2007-11-01 | Mitsuyoshi Tsukahara | Sentence analyzing device |
JP2008021093A (en) * | 2006-07-12 | 2008-01-31 | National Institute Of Information & Communication Technology | Sentence conversion processing system, translation processing system having sentence conversion function, voice recognition processing system having sentence conversion function, and speech synthesis processing system having sentence conversion function |
JP2012042991A (en) * | 2010-08-12 | 2012-03-01 | Fuji Xerox Co Ltd | Sentence generation program and sentence generation device |
US20120078612A1 (en) * | 2010-09-29 | 2012-03-29 | Rhonda Enterprises, Llc | Systems and methods for navigating electronic texts |
US20140156260A1 (en) * | 2012-11-30 | 2014-06-05 | Microsoft Corporation | Generating sentence completion questions |
CN103035243A (en) * | 2012-12-18 | 2013-04-10 | 中国科学院自动化研究所 | Real-time feedback method and system of long voice continuous recognition and recognition result |
CN103544309A (en) * | 2013-11-04 | 2014-01-29 | 北京中搜网络技术股份有限公司 | Splitting method for search string of Chinese vertical search |
CN104464757A (en) * | 2014-10-28 | 2015-03-25 | 科大讯飞股份有限公司 | Voice evaluation method and device |
JP2017004179A (en) * | 2015-06-08 | 2017-01-05 | 日本電信電話株式会社 | Information processing method, device, and program |
CN106407235A (en) * | 2015-08-03 | 2017-02-15 | 北京众荟信息技术有限公司 | A semantic dictionary establishing method based on comment data |
Also Published As
Publication number | Publication date |
---|---|
US11132506B2 (en) | 2021-09-28 |
TW201911289A (en) | 2019-03-16 |
WO2019023893A1 (en) | 2019-02-07 |
TWI676167B (en) | 2019-11-01 |
US20200160000A1 (en) | 2020-05-21 |
EP3642733A1 (en) | 2020-04-29 |
CN110945514B (en) | 2023-08-25 |
EP3642733A4 (en) | 2020-07-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9582489B2 (en) | Orthographic error correction using phonetic transcription | |
WO2018149209A1 (en) | Voice recognition method, electronic device, and computer storage medium | |
CN107301860B (en) | Voice recognition method and device based on Chinese-English mixed dictionary | |
US9672817B2 (en) | Method and apparatus for optimizing a speech recognition result | |
Schuster et al. | Japanese and korean voice search | |
EP2863300B1 (en) | Function execution instruction system, function execution instruction method, and function execution instruction program | |
US10140976B2 (en) | Discriminative training of automatic speech recognition models with natural language processing dictionary for spoken language processing | |
US11132506B2 (en) | System and method for segmenting a sentence | |
KR102441063B1 (en) | Apparatus for detecting adaptive end-point, system having the same and method thereof | |
US20110054901A1 (en) | Method and apparatus for aligning texts | |
CN106297797A (en) | Method for correcting error of voice identification result and device | |
US9779728B2 (en) | Systems and methods for adding punctuations by detecting silences in a voice using plurality of aggregate weights which obey a linear relationship | |
CN111369974B (en) | Dialect pronunciation marking method, language identification method and related device | |
US11282521B2 (en) | Dialog system and dialog method | |
US20180130465A1 (en) | Apparatus and method for correcting pronunciation by contextual recognition | |
CN108573707B (en) | Method, device, equipment and medium for processing voice recognition result | |
JP2007041319A (en) | Speech recognition device and speech recognition method | |
US20060241936A1 (en) | Pronunciation specifying apparatus, pronunciation specifying method and recording medium | |
CN109614623B (en) | Composition processing method and system based on syntactic analysis | |
KR20150086086A (en) | server for correcting error in voice recognition result and error correcting method thereof | |
KR101086550B1 (en) | System and method for recommendding japanese language automatically using tranformatiom of romaji | |
TWI713870B (en) | System and method for segmenting a text | |
KR102166446B1 (en) | Keyword extraction method and server using phonetic value | |
CN109002454B (en) | Method and electronic equipment for determining spelling partition of target word | |
US11361761B2 (en) | Pattern-based statement attribution |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |