CN110945514A - System and method for segmenting sentences - Google Patents

System and method for segmenting sentences Download PDF

Info

Publication number
CN110945514A
CN110945514A CN201780093452.0A CN201780093452A CN110945514A CN 110945514 A CN110945514 A CN 110945514A CN 201780093452 A CN201780093452 A CN 201780093452A CN 110945514 A CN110945514 A CN 110945514A
Authority
CN
China
Prior art keywords
phrase
sentence
evaluation score
segmentation
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201780093452.0A
Other languages
Chinese (zh)
Other versions
CN110945514B (en
Inventor
白洁
李秀林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Didi Infinity Technology and Development Co Ltd
Original Assignee
Beijing Didi Infinity Technology and Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Didi Infinity Technology and Development Co Ltd filed Critical Beijing Didi Infinity Technology and Development Co Ltd
Publication of CN110945514A publication Critical patent/CN110945514A/en
Application granted granted Critical
Publication of CN110945514B publication Critical patent/CN110945514B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • G10L13/10Prosody rules derived from text; Stress or intonation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Machine Translation (AREA)

Abstract

Embodiments of the present application provide systems and methods for segmenting sentences. The method may include identifying a first phrase in a sentence that is related to a first segmentation path, determining a first set of derived phrases that are semantically related to the first phrase, determining a first evaluation score based on a modified sentence generated by replacing the first phrase with the corresponding derived phrases in the first set, and segmenting the sentence based on the first evaluation score.

Description

System and method for segmenting sentences
Technical Field
The present application relates to text-to-speech (TS) techniques, and more particularly, to segmenting text sentences.
Background
Text-to-speech (TS) technology can transcribe textual information into audio signals. For example, in a navigation application (e.g., a DiDi app), text information such as traffic conditions, addresses, etc. may be presented to the user by voice. Books and news can also be read to the user by TS technology.
In order to read in a natural way, a piece of text (e.g. a sentence) must be correctly segmented before being transcribed into an audio signal. Typically, each phrase in a sentence contains one or more words. Consistent with the present application, a word may be a word in english, french, spanish, etc. in latin or a character such as chinese, korean, japanese, etc. in asian languages. These words or characters may be segmented into various possible combinations of phrases. In example I, The sentence of "The man over other iswatching TV" can be split according to The following two splitting paths:
first division path: "The man/over/heat is/watching TV".
Second split path: "The man/over heat/is/watching TV".
When generating an audio signal based on the first segmentation path, the transcription may not make sense to the user. However, conventional segmentation systems and methods cannot determine which segmentation path is better because each phrase in the two segmentation paths seems to be linguistically reasonable.
To address the above-mentioned problems, embodiments of the present application provide improved systems and methods for segmenting sentences.
Disclosure of Invention
One aspect of the present application relates to a method for segmenting a sentence. The method may include identifying, by a processor, a first phrase in a sentence that is related to a first segmentation path; determining, by the processor, a first set of derived phrases semantically related to the first phrase; determining, by the processor, a first evaluation score based on a modified sentence generated by replacing the first phrase with the respective derived phrase in the first group; and segmenting the sentence based on the first evaluation score.
Another aspect of the present application relates to a system for segmenting sentences. The system may include a communication interface for receiving the sentence; a memory for storing sentences and language models; and a processor for identifying the first phrase in a sentence that is associated with a first segmentation path; determining a first set of derived phrases semantically related to the first phrase; determining a first evaluation score based on a modified sentence generated by replacing the first phrase with the corresponding derived phrase in the first group; and segmenting the sentence based on the first evaluation score.
Yet another aspect of the present application relates to a non-transitory computer-readable medium storing a set of instructions that, when executed by at least one processor of a segmentation device, cause the segmentation device to perform a method for segmenting a sentence, the method may include: identifying a first phrase in the sentence that is associated with the first segmentation path; determining a first set of derived phrases semantically related to the first phrase; determining, by a processor, a first evaluation score based on a modified sentence generated by replacing the first phrase with the respective derived phrase in the first group; and segmenting the sentence based on the first evaluation score.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the invention as claimed.
Drawings
Fig. 1 is a block diagram of an exemplary system for segmenting sentences according to some embodiments of the present application.
FIG. 2 is two exemplary segmentation paths of a Chinese sentence according to some embodiments of the present application.
Fig. 3 is a flow diagram of an exemplary method for segmenting a sentence according to some embodiments of the present application.
Detailed Description
Reference will now be made in detail to exemplary embodiments and the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.
Fig. 1 is a block diagram of an exemplary system 100 for segmenting sentences according to some embodiments of the present application.
The system 100 may be a general-purpose server or a proprietary device for processing textual information in sentences. As shown in fig. 1, system 100 may include a communication interface 102, a processor 104, and a memory 116. The processor 104 may also include a number of functional modules, such as a tokenizer 106, a phrase recognizer 108, a derived phrase generator 110, a score determination unit 112, and a segmentation unit 114. These modules (and any corresponding sub-modules or sub-units) may be functional hardware units (e.g., portions of an integrated circuit) of the processor 104 designed for use with other components or portions of a program. The program may be stored in a computer readable medium and when executed by the processor 104, may perform one or more functions. Although FIG. 1 shows the units 106 and 114 as being entirely within one processor 104, it will be appreciated that these units may be distributed among multiple processors that are located proximate to each other or remote from each other. The system 100 may be implemented in the cloud or in a separate computer/server.
The communication interface 102 may be used to receive one or more sentences 120. The memory 116 may be used to store one or more sentences. The memory 116 may be implemented as any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, or a magnetic or optical disk.
According to an embodiment of the present application, the processor 104 may identify a first phrase in the sentence received by the communication interface 102. The first phrase is associated with a first segmentation path.
For example, tokenizer 106 may segment a sentence in one or more segmentation paths. As described in example I, "The manover ther is fetching TV" may be split into a first split path of "The man/over/her is/fetching" or a second split path of "The man/over her/is/fetching TV". The first phrase "hereis" is associated with the first segmentation path and the second phrase "over there" is associated with the second segmentation path. The phrases "ther is" and "over ther" may be identified by comparing the first and second segmentation paths and locating the difference between the segmentation paths. Referring to example I, the phrase identifier 108 may identify that the splits "over/other" and "over other/is" are different by comparing the first and second split paths. The phrase identifier 108 may also identify "heat is" as a first phrase associated with the first segmentation path and "over heat" as a second phrase associated with the second segmentation path. It is contemplated that the first and second segmentations are different segmentations over the same portion of the sentence. Thus, the first and second phrases have at least one word (e.g., "there") that overlaps.
The derived phrase generator 110 may then determine a first set of derived phrases that are semantically related to the first phrase. The derived phrase may be determined based on a semantic vector between the first phrase and the candidate phrase. The candidate phrases may be pre-stored in a phrase database, and each phrase in the database may be compared to the first phrase based on the semantic vector. Semantic vectors are mathematical models used to represent text documents (e.g., phrases) as vectors. In general, the difference between two semantic vectors associated with a phrase may be represented by the cosine distance between the two semantic vectors. In some embodiments, the first set of derived phrases may include synonyms of the first phrase. As shown in fig. 1, communication interface 102 may be further used to retrieve synonyms corresponding to the phrases from phrase database 122. For example, the first phrase "heres" may have several synonyms, such as "hereare," heres, "exist," have, "has," and the like. In some embodiments, the processor 104 may also determine a second set of derived phrases that are semantically related to the second phrase. The second set of derivative phrases associated with the second phrase "over there" may contain synonyms such as "heat", "over here", and the like.
Given the first set of derived phrases, score determination unit 112 may replace the first phrase with the corresponding derived phrase in the first set and determine a first evaluation score based on the modified sentence. For example, The sentence "The man/over/heat is/watching TV" may be modified by replacing The first phrase "heat is" with The synonyms "heat are", "her is", "exist", "have", "has", and The like. Accordingly, a plurality of modified sentences may be generated, including "The man/over/heat are/watching TV", "The man/over/here/watching TV", "The man/over/exist/watching TV", "The man/over/have/watching TV", "The man/over/has/watching TV", and The like. In some embodiments, a language model score may be determined for each modified sentence using the language model, and the language model scores may be averaged to derive the first evaluation score. The language model may evaluate the segmentation path according to natural language rules. In some embodiments, The modified sentences "The man/over/heat are/watching TV", "The man/over/her is/watching TV", "The man/over/exist/watching TV", "The man/over/have/watching TV" may be rated 47, 68, 35, 33 and 42 respectively. Thus, the first evaluation score may be an average of the above scores, i.e., 45 points. As an example, for The modified sentence "The man/over/there are/watching TV", The language model may determine that The singular subject using The plural verbs does not comply with The natural language rules, and thus The evaluation of The modified sentence score is low. It will be appreciated that a low score is not limiting as to whether a split path is inappropriate or appropriate.
The original sentence comprising the first phrase may also be evaluated by the language model to generate a first base score. For example, The original sentence "The man/over/there is/watching TV" can be evaluated as 68 points, which can be taken as The first base point.
Similarly, the score determination unit 112 may replace the second phrase (e.g., "over ther") with the corresponding derivative phrase in the second group and determine a second evaluation score based on the modified sentence. For example, the second evaluation score may be determined to be 67 points and the second base score may be determined to be 69 points.
The language model may be trained for a specified language such as English, Chinese, Japanese, and the like. For the sentence in example I above, the english language model may be used. The language model may be a generic model pre-stored in the memory 116 or a model trained for a particular domain (e.g., novel, legal, navigational, etc.) through, for example, machine learning.
It will be appreciated that when synonyms of phrases are used to modify the phrases of sentences within the appropriate segmentation path, the modified sentences should still be able to provide similar meaning and follow natural language rules. That is, when the language model evaluates the segmentation path based on the original sentence and the modified sentence, the language model scores based on the original sentence and the modified sentence should be close. On the other hand, when a phrase in an inappropriate segmentation path is replaced by a synonym for that phrase, the modified sentence may have a distinct meaning or even not conform to natural language rules. That is, the difference between the language model scores based on the original sentence and the modified sentence may be amplified so that the processor 104 may determine that the segmentation path is not appropriate. In some embodiments, the threshold may be predetermined, such as 5 points. When the difference is greater than or equal to the threshold, the corresponding segmentation path may be determined as an unsuitable segmentation path. It will be appreciated that the predetermined difference may be different for different language models.
The segmentation unit 114 may further segment the sentence based on the evaluation score (e.g., the first and/or second evaluation scores). For example, from the first segmentation path, a difference between a first base score (e.g., 68 points) and a first evaluation score (e.g., 45 points) may be determined. The difference is 68-45 points 23, which is greater than the threshold (e.g., 5 points). Accordingly, the processor 104 may determine that the first segmentation path is an unsuitable segmentation path. On the other hand, according to the second division path, a difference between a second base score (e.g., 69 points) and a second evaluation score (e.g., 67 points) may be determined. The difference is 69-67 cents, 2 cents, less than a threshold (e.g., 5 cents). Accordingly, the segmentation unit 114 may determine that the second segmentation path is a suitable segmentation path and segment the sentence according to the second segmentation path.
As described above, a first base score (e.g., 68) corresponding to a first segmentation path and a second base score (e.g., 69) corresponding to a second segmentation path are close, and selecting one segmentation path instead of the other based on these base scores may sometimes lead to errors. That is, a base score of 69 does not necessarily indicate that the corresponding segmentation path is better than the segmentation path having a base score of 68. However, by replacing the identified phrases in the first and second segmentation paths with synonyms, the difference between the first and second segmentation paths may be amplified so that the segmentation unit 114 may more accurately select one path from the plurality of paths based on the evaluation score associated with the corresponding path.
According to an embodiment of the present application, the segmentation unit 114 may segment the sentence based on a difference between the first evaluation score and the second evaluation score. For example, as described above, the first evaluation score and the second evaluation score are 45 points and 67 points, respectively. Accordingly, the segmentation unit 114 may select one of the first and second segmentation paths (e.g., the second segmentation path) having a higher evaluation score and segment the sentence according to the selected segmentation path.
It will be appreciated that all of the scores and thresholds described above are merely illustrative and may be modified if desired.
In addition to example I in english, the system 100 can process a text sentence in another language, such as chinese. FIG. 2 further illustrates two exemplary segmentation paths of a Chinese sentence according to some embodiments of the present application. The Chinese sentence means that Florence is powerful and Ewing is in the mean of nail. The pinyin is marked under each Chinese character of the sentence in the original text.
As shown in fig. 2, in example II, a chinese sentence may be segmented according to first and second segmentation paths. The first and second segmentation paths may be compared by tokenizer 106 to identify that phrase 202 and phrase 204 are different. The phrase identifier 108 may then identify the phrase corresponding to the "toenail" in the first segmentation path as the first phrase associated with the first segmentation path, and the phrase "toenail" represents "italian level league" in chinese. The phrase identifier 108 may also identify a phrase corresponding to "in-meaning" in the second segmentation path, and the phrase "in-meaning" means "care," "care," and the like. It is noted that the phrase "meaning first" and the phrase "meaning" both have the same meaning of the Chinese character "meaning" at the same time.
The derived phrase generator 110 may determine a first set of derived phrases and a second set of derived phrases that are semantically related to the first phrase and the second phrase, respectively. For example, derived phrases semantically related to the first phrase "nail-meaning" may include "western nail", "english-super", "german nail", and the like. Derived phrases semantically related to the second phrase "in-meaning" may include "care," "in-care," "worry," and the like.
Modified sentences may be generated by replacing the first and second phrases with corresponding derivative phrases, respectively. Then, based on the original sentence and the modified sentence, the score determining unit 112 may evaluate the first and second division paths using a language model of chinese. For example, the first and second segmentation paths based on the original sentence may be scored as 58.342 and 59.081, respectively, and the first and second segmentation paths based on the modified sentence may be scored as 58.561 and 34.952, respectively.
Based on the evaluation score, the segmentation unit 114 may select a first segmentation path (e.g., 58.561 instead of 34.952) having a higher evaluation score and segment the sentence accordingly. Note that in this example, selecting a split path based on the base score may result in an error because the second split path actually has a slightly higher base score.
The system 100 described above can amplify (and sometimes reverse) the difference between two or more segmentation paths by replacing recognized phrases with synonyms and select an appropriate path for segmenting a sentence when the language model cannot distinguish between them.
Another aspect of the present application relates to a method for segmenting a sentence. Fig. 3 is a flow diagram of an exemplary method 300 for segmenting a sentence according to some embodiments of the present application. For example, the method 300 may be implemented by a segmentation device and may include steps S302-S308.
In step S302, the segmentation apparatus may identify a first phrase in the sentence that is associated with the first segmentation path. In some embodiments, the segmentation device may generate at least two segmentation paths of the sentence. Each of the segmentation paths may include a plurality of phrases. The segmentation device may compare the plurality of phrases and identify phrases in the first segmentation path that are different from corresponding phrases in other segmentation paths. Similarly, the segmentation device may identify a phrase in the second segmentation path. Further, the segmentation device may identify a first phrase in the sentence that is associated with the first segmentation path and a second phrase in the sentence that is associated with the second segmentation path. Since the first and second segmentation paths are different segmentation paths of the same sentence or the same part of a sentence, the first and second phrases may have at least one overlapping word.
In step S304, the segmentation apparatus may determine a first set of derived phrases semantically related to the first phrase. The derived phrase may be determined based on a semantic vector between the first phrase and the candidate phrase. The candidate phrases may be pre-stored in a phrase database, and each phrase in the database may be compared to the first phrase based on the semantic vector. In some embodiments, the first set of derived phrases may include synonyms of the first phrase. Similarly, the segmentation device may determine a second set of derived phrases that are semantically related to the second phrase.
In step S306, the segmentation apparatus may determine a first evaluation score based on the modified sentence generated by replacing the first phrase with each derived phrase in the first group. A language model score may be determined for each modified sentence using the language model, and the language model scores may be averaged to derive a first evaluation score. The language model may evaluate the segmentation path according to natural language rules. The segmentation means may further determine the second evaluation score based on a modified sentence generated by replacing the second phrase with the corresponding derived phrase in the second group. A first base score and a second base score may also be determined based on the sentence. For example, first and second phrases of a sentence may be scored by a language model to generate first and second base scores. It is to be understood that the average score is merely an example of evaluating the segmentation path. The individual scores may be manipulated or combined in any suitable manner to arrive at an evaluation score. For example, instead of a direct average of the individual scores, the evaluation score may be a weighted average of the individual scores, and the weights may correspond to the proximity of the individual synonyms to the phrase. As another example, a change in a single language model score of a modified sentence may be used to determine whether the language model score has changed significantly. If the language model score changes significantly, it may indicate that the corresponding segmentation path is not appropriate.
In step S308, the segmentation apparatus may segment the sentence based on the first evaluation score. In some embodiments, the segmentation device may compare the first evaluation score with the first base score and segment the sentence according to the first segmentation path when a difference between the first base score and the first evaluation score is less than a threshold. As described above, when a sentence is properly segmented, the evaluation score based on the modified sentence should be close to the base score based on the original sentence. In other embodiments, the segmentation device may segment the sentence based on a difference between the first evaluation score and the second evaluation score. For example, the segmentation apparatus may select one of the first and second segmentation paths having a higher evaluation score, and segment the sentence according to the selected segmentation path.
Yet another aspect of the application relates to a non-transitory computer-readable medium storing instructions that, when executed, cause one or more processors to perform the method, as described above. The computer-readable medium may include volatile or nonvolatile, magnetic, semiconductor, tape, optical, removable, non-removable, or other types of computer-readable medium or computer-readable storage device. For example, as disclosed, the computer-readable medium may be a storage device or memory module having stored thereon computer instructions. In some embodiments, the computer readable medium may be a disk or flash drive having computer instructions stored thereon.
It will be apparent to those skilled in the art that various modifications and variations can be made to the disclosed segmentation system and associated methods. Other embodiments will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed system and associated method. Although the embodiments are described with respect to two split paths as an example, the described splitting systems and methods may be applied to more than two split paths.
It is intended that the specification and examples be considered as exemplary only, with a true scope being indicated by the following claims and their equivalents.

Claims (20)

1. A computer-implemented method for segmenting a sentence, comprising:
identifying, by a processor, a first phrase in the sentence that is related to the first segmentation path;
determining, by the processor, a first set of derived phrases semantically related to the first phrase;
determining, by the processor, a first evaluation score based on a modified sentence generated by replacing the first phrase with the corresponding derived phrase in the first group; and
segmenting the sentence based on the first evaluation score.
2. The method of claim 1, wherein determining the first evaluation score comprises:
determining a language model score for each modified sentence; and
averaging the language model scores to derive the first evaluation score.
3. The method of claim 2, wherein the language model score is generated using a language model according to natural language rules.
4. The method of claim 1, further comprising:
determining a first base score based on the sentence comprising the first phrase; and
segmenting the sentence based on a difference between the first base score and the first evaluation score.
5. The method of claim 4, wherein the sentence is segmented according to the first segmentation path when a difference between the first base score and the first evaluation score is less than a threshold.
6. The method of claim 1, further comprising:
identifying a second phrase in the sentence that is related to the second segmentation path;
determining a second set of derived phrases semantically related to the second phrase;
determining a second evaluation score based on a modified sentence generated by replacing the second phrase with the corresponding derived phrase in the second group; and
segmenting the sentence based on a difference between the first evaluation score and the second evaluation score.
7. The method of claim 6, wherein the second phrase has at least one overlapping word with the first phrase.
8. The method of claim 6, further comprising:
identifying the first phrase and the second phrase based on a difference between the first segmentation path and the second segmentation path.
9. The method of claim 6, further comprising:
selecting one of the first segmentation path and the second segmentation path having a higher evaluation score; and
and segmenting the sentence according to the selected segmentation path.
10. The method of claim 1, wherein the derived phrase is determined based on a semantic vector between the first phrase and a candidate phrase.
11. A system for segmenting sentences, comprising:
a communication interface for receiving the sentence;
a memory for storing sentences and language models; and
the processor is used for
Identifying the first phrase in a sentence that is related to a first segmentation path;
determining a first set of derived phrases semantically related to the first phrase;
determining a first evaluation score based on a modified sentence generated by replacing the first phrase with the corresponding derived phrase in the first group; and
segmenting the sentence based on the first evaluation score.
12. The system of claim 11, wherein the processor is further configured to:
determining a language model score for each modified sentence; and
averaging the language model scores to derive the first evaluation score.
13. The system of claim 12, wherein the language model score is generated using a natural language rules based language model.
14. The system of claim 11, wherein the processor is further configured to:
determining a first base score based on the sentence comprising the first phrase; and
segmenting the sentence based on a difference between the first base score and the first evaluation score.
15. The system of claim 14, wherein the sentence is segmented according to the first segmentation path when a difference between the first base score and the first evaluation score is less than a threshold.
16. The system of claim 11, wherein the processor is further configured to:
identifying a second phrase in the sentence that is related to the second segmentation path;
determining a second set of derived phrases semantically related to the second phrase;
determining a second evaluation score based on a modified sentence generated by replacing the second phrase with the corresponding derived phrase in the second group; and
segmenting the sentence based on a difference between the first evaluation score and the second evaluation score.
17. The system of claim 16, wherein the second phrase has at least one overlapping word with the first phrase.
18. The system of claim 16, wherein the processor is further configured to:
identifying the first phrase and the second phrase based on a difference between the first segmentation path and the second segmentation path.
19. The system of claim 16, wherein the processor is further configured to:
selecting one of the first segmentation path and the second segmentation path having a higher evaluation score; and
and segmenting the sentence according to the selected segmentation path.
20. A non-transitory computer-readable medium storing a set of instructions that, when executed by at least one processor of a segmentation device, cause the segmentation device to perform a method for segmenting a sentence, the method comprising:
identifying a first phrase in the sentence that is associated with the first segmentation path;
determining a first set of derived phrases semantically related to the first phrase;
determining, by the processor, a first evaluation score based on a modified sentence generated by replacing the first phrase with the respective derived phrase in the first group; and
segmenting the sentence based on the first evaluation score.
CN201780093452.0A 2017-07-31 2017-07-31 System and method for segmenting sentences Active CN110945514B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2017/095305 WO2019023893A1 (en) 2017-07-31 2017-07-31 System and method for segmenting a sentence

Publications (2)

Publication Number Publication Date
CN110945514A true CN110945514A (en) 2020-03-31
CN110945514B CN110945514B (en) 2023-08-25

Family

ID=65232340

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201780093452.0A Active CN110945514B (en) 2017-07-31 2017-07-31 System and method for segmenting sentences

Country Status (5)

Country Link
US (1) US11132506B2 (en)
EP (1) EP3642733A4 (en)
CN (1) CN110945514B (en)
TW (1) TWI676167B (en)
WO (1) WO2019023893A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3642733A4 (en) * 2017-07-31 2020-07-22 Beijing Didi Infinity Technology and Development Co., Ltd. System and method for segmenting a sentence
CN110134949B (en) * 2019-04-26 2022-10-28 网宿科技股份有限公司 Text labeling method and equipment based on teacher supervision
US11544466B2 (en) 2020-03-02 2023-01-03 International Business Machines Corporation Optimized document score system using sentence structure analysis function

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006018354A (en) * 2004-06-30 2006-01-19 Advanced Telecommunication Research Institute International Text division device and natural language processor
JP2007286901A (en) * 2006-04-17 2007-11-01 Mitsuyoshi Tsukahara Sentence analyzing device
JP2008021093A (en) * 2006-07-12 2008-01-31 National Institute Of Information & Communication Technology Sentence conversion processing system, translation processing system having sentence conversion function, voice recognition processing system having sentence conversion function, and speech synthesis processing system having sentence conversion function
JP2012042991A (en) * 2010-08-12 2012-03-01 Fuji Xerox Co Ltd Sentence generation program and sentence generation device
US20120078612A1 (en) * 2010-09-29 2012-03-29 Rhonda Enterprises, Llc Systems and methods for navigating electronic texts
CN103035243A (en) * 2012-12-18 2013-04-10 中国科学院自动化研究所 Real-time feedback method and system of long voice continuous recognition and recognition result
CN103544309A (en) * 2013-11-04 2014-01-29 北京中搜网络技术股份有限公司 Splitting method for search string of Chinese vertical search
US20140156260A1 (en) * 2012-11-30 2014-06-05 Microsoft Corporation Generating sentence completion questions
CN104464757A (en) * 2014-10-28 2015-03-25 科大讯飞股份有限公司 Voice evaluation method and device
JP2017004179A (en) * 2015-06-08 2017-01-05 日本電信電話株式会社 Information processing method, device, and program
CN106407235A (en) * 2015-08-03 2017-02-15 北京众荟信息技术有限公司 A semantic dictionary establishing method based on comment data

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7680648B2 (en) * 2004-09-30 2010-03-16 Google Inc. Methods and systems for improving text segmentation
JP4481972B2 (en) * 2006-09-28 2010-06-16 株式会社東芝 Speech translation device, speech translation method, and speech translation program
CN101261623A (en) * 2007-03-07 2008-09-10 国际商业机器公司 Word splitting method and device for word border-free mark language based on search
CN102479191B (en) * 2010-11-22 2014-03-26 阿里巴巴集团控股有限公司 Method and device for providing multi-granularity word segmentation result
TWI501097B (en) * 2012-12-22 2015-09-21 Ind Tech Res Inst System and method of analyzing text stream message
US20140244240A1 (en) * 2013-02-27 2014-08-28 Hewlett-Packard Development Company, L.P. Determining Explanatoriness of a Segment
CN103793491B (en) * 2014-01-20 2017-01-25 天津大学 Chinese news story segmentation method based on flexible semantic similarity measurement
CN103927358B (en) * 2014-04-15 2017-02-15 清华大学 text search method and system
TW201715420A (en) * 2015-10-30 2017-05-01 元智大學 Method and system for analysing the weight of text
US20190179316A1 (en) * 2016-08-25 2019-06-13 Purdue Research Foundation System and method for controlling a self-guided vehicle
US11087068B2 (en) * 2016-10-31 2021-08-10 Fujifilm Business Innovation Corp. Systems and methods for bringing document interactions into the online conversation stream
EP3642733A4 (en) * 2017-07-31 2020-07-22 Beijing Didi Infinity Technology and Development Co., Ltd. System and method for segmenting a sentence
CN110998589B (en) * 2017-07-31 2023-06-27 北京嘀嘀无限科技发展有限公司 System and method for segmenting text

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2006018354A (en) * 2004-06-30 2006-01-19 Advanced Telecommunication Research Institute International Text division device and natural language processor
JP2007286901A (en) * 2006-04-17 2007-11-01 Mitsuyoshi Tsukahara Sentence analyzing device
JP2008021093A (en) * 2006-07-12 2008-01-31 National Institute Of Information & Communication Technology Sentence conversion processing system, translation processing system having sentence conversion function, voice recognition processing system having sentence conversion function, and speech synthesis processing system having sentence conversion function
JP2012042991A (en) * 2010-08-12 2012-03-01 Fuji Xerox Co Ltd Sentence generation program and sentence generation device
US20120078612A1 (en) * 2010-09-29 2012-03-29 Rhonda Enterprises, Llc Systems and methods for navigating electronic texts
US20140156260A1 (en) * 2012-11-30 2014-06-05 Microsoft Corporation Generating sentence completion questions
CN103035243A (en) * 2012-12-18 2013-04-10 中国科学院自动化研究所 Real-time feedback method and system of long voice continuous recognition and recognition result
CN103544309A (en) * 2013-11-04 2014-01-29 北京中搜网络技术股份有限公司 Splitting method for search string of Chinese vertical search
CN104464757A (en) * 2014-10-28 2015-03-25 科大讯飞股份有限公司 Voice evaluation method and device
JP2017004179A (en) * 2015-06-08 2017-01-05 日本電信電話株式会社 Information processing method, device, and program
CN106407235A (en) * 2015-08-03 2017-02-15 北京众荟信息技术有限公司 A semantic dictionary establishing method based on comment data

Also Published As

Publication number Publication date
US11132506B2 (en) 2021-09-28
TW201911289A (en) 2019-03-16
WO2019023893A1 (en) 2019-02-07
TWI676167B (en) 2019-11-01
US20200160000A1 (en) 2020-05-21
EP3642733A1 (en) 2020-04-29
CN110945514B (en) 2023-08-25
EP3642733A4 (en) 2020-07-22

Similar Documents

Publication Publication Date Title
US9582489B2 (en) Orthographic error correction using phonetic transcription
WO2018149209A1 (en) Voice recognition method, electronic device, and computer storage medium
CN107301860B (en) Voice recognition method and device based on Chinese-English mixed dictionary
US9672817B2 (en) Method and apparatus for optimizing a speech recognition result
Schuster et al. Japanese and korean voice search
EP2863300B1 (en) Function execution instruction system, function execution instruction method, and function execution instruction program
US10140976B2 (en) Discriminative training of automatic speech recognition models with natural language processing dictionary for spoken language processing
US11132506B2 (en) System and method for segmenting a sentence
KR102441063B1 (en) Apparatus for detecting adaptive end-point, system having the same and method thereof
US20110054901A1 (en) Method and apparatus for aligning texts
CN106297797A (en) Method for correcting error of voice identification result and device
US9779728B2 (en) Systems and methods for adding punctuations by detecting silences in a voice using plurality of aggregate weights which obey a linear relationship
CN111369974B (en) Dialect pronunciation marking method, language identification method and related device
US11282521B2 (en) Dialog system and dialog method
US20180130465A1 (en) Apparatus and method for correcting pronunciation by contextual recognition
CN108573707B (en) Method, device, equipment and medium for processing voice recognition result
JP2007041319A (en) Speech recognition device and speech recognition method
US20060241936A1 (en) Pronunciation specifying apparatus, pronunciation specifying method and recording medium
CN109614623B (en) Composition processing method and system based on syntactic analysis
KR20150086086A (en) server for correcting error in voice recognition result and error correcting method thereof
KR101086550B1 (en) System and method for recommendding japanese language automatically using tranformatiom of romaji
TWI713870B (en) System and method for segmenting a text
KR102166446B1 (en) Keyword extraction method and server using phonetic value
CN109002454B (en) Method and electronic equipment for determining spelling partition of target word
US11361761B2 (en) Pattern-based statement attribution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant