WO2018184510A1 - 分词方法、装置及存储介质 - Google Patents

分词方法、装置及存储介质 Download PDF

Info

Publication number
WO2018184510A1
WO2018184510A1 PCT/CN2018/081536 CN2018081536W WO2018184510A1 WO 2018184510 A1 WO2018184510 A1 WO 2018184510A1 CN 2018081536 W CN2018081536 W CN 2018081536W WO 2018184510 A1 WO2018184510 A1 WO 2018184510A1
Authority
WO
WIPO (PCT)
Prior art keywords
input
text
target
string
character string
Prior art date
Application number
PCT/CN2018/081536
Other languages
English (en)
French (fr)
Inventor
樊林
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2018184510A1 publication Critical patent/WO2018184510A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Definitions

  • the present application relates to the field of data processing technologies, and in particular, to a word segmentation method and apparatus.
  • AI Artificial Intelligence
  • Chinese text is a basic writing unit. There is no space between words and words. Words can be composed of single words or multiple words. That is, the number of words contained in a word is uncertain. of. Therefore, the first step in understanding the text entered in Chinese is the word segmentation, that is, the correct segmentation of the words.
  • the commonly used word segmentation methods mainly include the following three types: a word segmentation method based on string matching, a word segmentation method based on understanding, and a word segmentation method based on statistics.
  • the word segmentation method based on string matching is to match the Chinese character string to be analyzed with the term in the machine dictionary according to a certain strategy. If a word is found in the dictionary, the matching is successful (a word is recognized).
  • the understanding-based word segmentation method achieves the effect of recognizing words by letting the computer simulate human understanding of the sentence, but it also requires a large amount of language knowledge and information.
  • the statistical-based word segmentation method counts the frequency of the combination of adjacent co-occurrence words in the corpus, calculates their mutual information and calculates the adjacent co-occurrence probability of two Chinese characters to determine whether the two words need to be cut. Minute.
  • the embodiment of the present application proposes a word segmentation method to improve the accuracy of segmentation of text input by a user.
  • a word segmentation method that includes:
  • Detecting a text input operation in a text input component the text input operation comprising one or more string input operations in sequence;
  • each target character string is a string entered by a string input operation
  • the delay input duration is an interval length of a string input operation corresponding to the string input operation corresponding to the target string and the adjacent target string;
  • the input text is segmented by using the delay input duration of each target string as a word segmentation condition.
  • the embodiment of the present application also proposes a word segmentation device.
  • a word segmentation device comprising:
  • the memory readable instruction module is stored in the memory;
  • the machine readable instruction module includes:
  • a text input operation detecting module configured to detect a text input operation in the text input component, the text input operation comprising one or more string input operations in sequence;
  • An input text display module configured to sequentially splicing one or more target character strings input by the one or more character string input operations, and displaying the spliced input text in the text input component,
  • Each of the target strings is a string input by a string input operation
  • a delay input duration calculation module configured to obtain a delay input duration of each target string, where the delay input duration is a string input operation corresponding to the target string and a string input operation corresponding to the adjacent target string Interval length;
  • a word segmentation module configured to segment the input text by using a delay input duration of each target string as a word segmentation condition.
  • the embodiment of the present application also provides a non-transitory computer readable storage medium, wherein the storage medium stores machine readable instructions, which are executable by a processor to perform the above method.
  • FIG. 1 is a schematic flow chart of a word segmentation method in an embodiment
  • FIG. 2 is a schematic diagram of input through a text input component in one embodiment
  • FIG. 3 is a schematic diagram of input through a text input component in one embodiment
  • FIG. 4 is a schematic diagram of input through a text input component in one embodiment
  • Figure 5 is a schematic structural view of a word segmentation device in an embodiment
  • FIG. 6 is a schematic structural diagram of a computer device running the foregoing word segmentation method in an embodiment.
  • Some word segmentation methods are based on a fixed pattern for word segmentation, while some texts have multiple word segmentation methods. Each word segmentation method may have different meanings, which leads to the inaccurate result of the word segmentation; When inputting text, it will be accompanied by the user's habits, for example, typing habits, or typing typos, etc.
  • the traditional word segmentation method does not take into account the real situation when the user inputs the text, ignoring the user's real needs, and also leading to the word segmentation. Insufficient accuracy.
  • a word segmentation method is proposed, and the implementation of the method may depend on a computer program, which can be run on the basis of Von Neumann
  • the computer program can be a word segmentation application on a terminal or server, or a text word segmentation application integrated in an application that provides a text input component to receive a string or text entered by the user.
  • the computer system may be a server or terminal such as a smartphone, tablet, personal computer or the like that runs the above computer program.
  • the above word segmentation method includes the following steps S102-S108:
  • Step S102 detecting a text input operation in the text input component, the text input operation including one or more string input operations in sequence.
  • a text input component refers to a component that allows a user to input text, such as a text input box, a document input interface, a search input box, and the like, which allow a user to input a single or multiple lines of text.
  • a text input box such as a text input box, a document input interface, a search input box, and the like
  • the user can input through the displayed text input box 201, and after completing the input, click the "send" button to send the input text.
  • the text input component is not limited to the text input box 201 in the chat interface of the instant messaging software shown in FIG. 2, and may be another text input box, for example, a search input box, or Word.
  • Text editing windows such as text editing software, are text input components.
  • the display scene of the text input component is not limited, that is, it may be a search input box of the search interface, or a text input box of the dialog window, or a text editing window of the document editing page, as long as it is a user.
  • Text can be entered through the text input component of the presentation.
  • the user input operation can be detected through the text input component, and the specific content input by the user can be obtained, and the operation of inputting the specific content by the user through the text input component is the text input operation.
  • the user's single text input operation is a string input operation
  • the user's text input operation includes at least one character string input operation, that is, the user inputs the text input operation. It is a single character, or it can be multiple characters. For example, a word containing multiple characters is input at a time through the input method. For example, in the application scenario shown in FIG. 2, the user inputs "zhuanli" through the pinyin input method once. Sex input string "patent").
  • most users input by an input method (for example, pinyin input method) when inputting, for example, input by a pinyin input method included in the Sogou input method installed on the terminal.
  • an input method for example, pinyin input method
  • the user inputs through the installed Pinyin input method.
  • the candidate word display frames 202a-202d corresponding to the input pinyin are displayed, and the user can display the candidate word display frame 202a ⁇ In 202d, select the character or string you want to enter.
  • the input method displays the candidate words automatically
  • the matching words are automatically displayed according to the user's input, thereby reducing the number of operations that the user needs to input when selecting the target characters to be input among the candidate words to be displayed. Therefore, most users input text that needs to be entered through a string input operation.
  • the candidate word display box 202b shown in FIG. 3 when the user inputs "zhuan”, the candidate word display box 202b shown in FIG. 3 is displayed, and the "special” is input through the selection operation corresponding to the number 3;
  • the candidate word display box 202c shown in FIG. 4 is displayed. The user needs to input the "li” by inputting a page turning operation and a selection operation corresponding to the number "2" to complete the character string. The input of the patent.
  • the user can input the string “patent” once by inputting “zhuanli”, and only need to input a selection operation corresponding to the number "1" to display the candidate words in the display.
  • the input "patent” is selected in block 202a to complete the input.
  • the target character string input by the user through a character string input operation may be a character of a word or a character string composed of a plurality of characters. That is to say, a string input operation corresponds to a character string input by the user in the text input component at one time.
  • a string input operation corresponds to a character string input by the user in the text input component at one time.
  • the string contains multiple characters
  • multiple characters contained in the string are input to the text. Enter the input component, that is, the input time corresponding to multiple characters contained in the string is the same.
  • the text input operation includes at least one string input operation in the order.
  • Step S104 sequentially splicing one or more target character strings input by the character string input operation, and displaying the spliced input text in the text input component.
  • the number of times of the character string input operation includes one or more times
  • the number of the target character strings is one or more
  • the target character strings are spliced according to the input order.
  • each character string input operation corresponds to an input character string, that is, a target character string corresponding to the string input operation.
  • the input time of the string input operation is the input time of the target character string corresponding to the string input operation, and therefore, the input time of all the target character strings corresponding to the user's text input operation can be determined, and Any two target strings are different from the input time.
  • the target string corresponding to the detected string input operation is acquired, and the input time of each target string is input (ie, the input time of the string input operation); then according to each The input time of the target string, determine the input order of the target string, and then stitch the target string into the input text and display it in the text input component.
  • the input text displayed in the text input component is spliced into the target character string corresponding to the string input operation that has been input yet, and the user can continue to input through the text input component.
  • String input operation to enter the target string.
  • Step S106 Acquire a delay input duration of each target character string, where the delay input duration is an interval duration of a character string input operation corresponding to the target character string corresponding to the character string input operation corresponding to the target character string.
  • the delay input duration of the target string refers to the time difference or interval between the input time of the input target string and the input time of the input target string, that is, the string input operation and phase corresponding to the target string.
  • the interval between the input operation of the string corresponding to the target string of the neighbor is not limited to the delay input duration of the target string.
  • the target character string adjacent to the target character string may be a target character string adjacent to the target character string (ie, Entering a target string whose time is before the target string and adjacent to the target string), or may be a target string adjacent to the target string (ie, the input time is after the target string and
  • the target string adjacent to the target string is not limited in this embodiment.
  • the input of the target string adjacent to the front or the next adjacent is used. Time must be selected between the previous adjacent target string and the next adjacent target string, and the selection must be consistent when calculating the corresponding delayed input time for all target strings.
  • the above-mentioned acquisition target character is used.
  • the process of delaying the input duration of the string is specifically: obtaining a first timestamp of the string input operation corresponding to the target string, and a second timestamp of the string input operation corresponding to the corresponding adjacent target string; A timestamp and a second timestamp determine the delay input duration of the target string.
  • the input time corresponding to the string input operation of the target string that is, the timestamp corresponding to the input operation of the string, that is, the first timestamp; the string of the target string adjacent to the target string to one string
  • the input time of the input operation is the timestamp corresponding to the string input operation of the target string adjacent to the latter, that is, the second timestamp.
  • the time interval between the first timestamp and the second timestamp is the interval time between the input time of the target character string and the target character string adjacent to the next, that is, the first timestamp and the second timestamp.
  • the interval between the lengths is the delay input duration corresponding to the target string.
  • the above-mentioned acquisition target is adopted by calculating a time interval of the input time of the current target character string and the previous adjacent target character string as the delay input duration corresponding to the target character string.
  • the process of delaying the input duration of the string is specifically: obtaining a third timestamp of the string input operation corresponding to the target string, and a fourth timestamp of the string input operation corresponding to the corresponding preceding adjacent target string; The third timestamp and the fourth timestamp determine the delay input duration of the target string.
  • the target string pair adjacent to the target string is three characters
  • the input time of the string input operation is the timestamp corresponding to the string input operation of the preceding adjacent target string, that is, the fourth timestamp.
  • the time interval between the third timestamp and the fourth timestamp is the interval between the target character string and the input time of the preceding adjacent target character string, that is, the third timestamp and the fourth timestamp.
  • the interval between the lengths is the delay input duration corresponding to the target string.
  • Step S108 segment the input text by using the delay input duration of each target character string as a word segmentation condition.
  • the word “hot pot” is generally input once by the pinyin "huoguo", so that it is necessary to select among the options when inputting a single word.
  • the input operation of the user input “hot pot” is a string input operation, and the time stamps corresponding to the characters "fire” and “pot” are consistent.
  • the interval between the input characters or the character string is different due to thinking, pause, etc., and some may be shorter, and the interval between the partial characters may be longer.
  • the time interval between input times is generally greater than the time interval between the input times of semantically connected or non-segmentable characters. That is, if there is a large time interval between inputting two characters or a string input time, it means that the user pauses for a long time because of the habit or continuity of expression or because of thinking. In the case, the two characters or strings are semantically discontinuous or are to be segmented.
  • the input text may be segmented according to the delay input duration of the target character string, that is, the delay input duration of the target character string is used as a word segmentation condition pair.
  • the input text is segmented.
  • the delay input duration of the target string needs to be considered.
  • the delay input duration of the target string needs to be used as the word segmentation in the word segmentation decision.
  • An influence factor, whether a target string is distinguished from an adjacent target string, needs to consider the delay input duration of the target string.
  • the case where the delay input duration of the target character string is determined by the interval between the target character string and the subsequent adjacent target character string being the input time is taken as an example.
  • the process of segmenting the input text by using the delay input duration of the target character string as a word segmentation condition is specifically: acquiring the end of the target character string whose delay input duration is greater than or equal to the first threshold value in the input text. a position; the input text is divided into at least one text segment according to the end position; and the text segment is separately subjected to word segmentation processing.
  • the time interval of the input time of the string the longer the delay input time of the target string, the longer the pause time between the target string and the target string adjacent to the next, the semantic connection or the non-segmentation The lower the probability. Therefore, when the delay input duration of the target string is greater than a preset time threshold (first threshold), the target string is segmented between the target string and the subsequent adjacent target string, that is, at the end position of the target string. Segmentation.
  • the input text includes more than one target character string, and the delay input time of the target character string satisfying the segmentation condition in the process of segmenting the input text according to the delay input time of the target character string. There is more than one. Therefore, in the case that the delay input duration is greater than or equal to the preset time threshold, after the input text is segmented according to the end of the target character string, the input text can be divided into a plurality of text segments, each text segment including One or more target strings.
  • the segmented text segment in the process of the user manually typing to input, if the user manually types the speed faster or does not pause during the typing process, some of the two points that should be split in the process of word segmentation should be divided.
  • the time interval for inputting characters or strings may also be small, resulting in no segmentation when the input text is segmented according to the delay input duration of the character string.
  • the segmented text segment can be further segmented, for example, using other word segmentation methods or word segmentation components to further segment the text segment to improve the accuracy and validity of the segmentation.
  • the result of the word segmentation of the above input text is "I just bought one ticket / tomorrow morning at 7 o'clock / fly / Beijing ticket", obviously, this is not the final Word segmentation results.
  • the final word segmentation result is "I / just / buy / / / / / / tomorrow / morning / seven o'clock / fly / Beijing / / ticket”.
  • the word segmentation method adopted may be a word segmentation method based on string matching, that is, a mechanical word segmentation method.
  • string matching a word segmentation method based on string matching
  • the Chinese character string to be analyzed is matched with the term in the machine dictionary. If a word is found in the dictionary, the matching is successful (a word is recognized).
  • the string matching method can be divided into forward matching and reverse matching; according to the criteria of priority matching with different lengths, it can be divided into the largest (longest) matching and the smallest (shortest) matching.
  • the understanding of the word segmentation method area can also be used to simulate human understanding of the sentence by computer, to achieve the effect of recognizing the word, that is, syntactic and semantic analysis at the same time as the word segmentation, and processing by syntactic information and semantic information. Ambiguity.
  • a statistical-based word segmentation method can also be employed.
  • words are a combination of stable words, so in the context, the more times adjacent words appear at the same time, the more likely they are to constitute a word. Therefore, the frequency or probability of co-occurrence of words and words can be better reflected in the credibility of words.
  • the frequency of the combination of adjacent words co-occurring in the corpus can be counted and their mutual information can be calculated.
  • the mutual information reflects the closeness of the relationship between Chinese characters.
  • the word group may be considered to constitute a word. That is to say, it is only necessary to count the frequency of the words in the corpus, and there is no need to divide the dictionary, so it is called a no-word dictionary or a statistical word-taking method.
  • a word is composed of a plurality of Chinese characters or a plurality of English characters instead of a combination of Chinese characters and non-Chinese characters. That is to say, the position between the Chinese character and the non-Chinese character is paused or cut.
  • the Chinese characters can be directly segmented between the above-mentioned non-Chinese characters.
  • the separation position of the separator included in the input text is obtained, and according to the separation position
  • the input text is divided into at least one text segment; the text segment is segmented with reference to a delay input duration of the target character string in the text segment.
  • the separator is punctuation, English symbols, English letters, and other non-Chinese characters.
  • the input text can be segmented according to the position of the separator contained in the input text, and the separator is separated from other characters in the input text, thereby inputting the text. Divide into a number of text segments, and then segment the text segment.
  • the input text is "I sent you a message on QQ at 7:00 last night, did you see it?"
  • the above input text was split according to the separator to get "I was last night / 7 / point I sent you a message on /QQ/, / Did you see it?", including "I last night”, “pointing on”, “sending a message to you", “Do you see it?” 4 texts Segment, and then separate the above four text segments for word segmentation.
  • the input text is segmented according to the separator in advance, which reduces the calculation amount in the subsequent word segmentation processing, and improves the word segmentation efficiency of the word segmentation processing.
  • a character or a character string can be manually input by handwriting or a soft keyboard, or a plurality of character strings can be input at once by copying and pasting, in this case.
  • the input time corresponding to all characters or strings copied and pasted is the same.
  • the input text includes at least one adjacent character string by copying and pasting, all the characters copied and pasted are directly divided into one text segment according to the delay input time, and then the text segment is subjected to word segmentation processing.
  • the process of segmenting the input text by using the delayed input duration of the target character string as a word segmentation condition is specifically: segmenting the input text by a word segmentation component to obtain at least one target word; In the target word, the corresponding delay input duration is greater than or equal to the end position of the target character string of the first threshold; and the target word is segmented according to the end position.
  • the input text is first segmented by other word segmentation algorithms, for example, the input text is segmented by a minimum segmentation algorithm, and at least one target word corresponding to the input text is obtained. Since the obtained plurality of target words may have a problem that the word segmentation is incomplete, the at least one target word obtained by performing word segmentation processing on the input text by the word segmentation component acquires all the characters or character strings included in the target word. Delay the input duration, and then segment the target word according to the delay input length corresponding to the character or string.
  • the word segmentation component may be based on an arbitrary word segmentation algorithm (including a user-defined word segmentation algorithm) or multiple word segmentation algorithms, for example, may be based on a maximum forward matching method and an inverse maximum matching method.
  • a word segmentation algorithm such as a minimum segmentation method, a two-way matching method, or a full segmentation algorithm.
  • At least one target word obtained by segmenting the input text according to the word segmentation component may also have a problem of excessive segmentation, that is, words that should not be segmented and containing more than one word are separated. In this case, whether or not the target character string (the target character string separated by the adjacent target word) exists in the convergence of the adjacent target words among all the target words obtained by the segmentation in the input text. If it exists, merge the two adjacent target words.
  • the method further includes: acquiring a target character string at an interface between adjacent target words in the input text, and using the adjacent target object The words are merged, and if there is no target string at the junction between adjacent target words in the input text, the step of merging the target words is skipped, and directly executing the acquiring the target words, the corresponding delay input duration is greater than or An end position of the target character string equal to the first threshold, and the target word is segmented according to the end position.
  • the input text is "I sent you an email last night"
  • at least one target word obtained by segmenting the input text by the word segmentation component is "I/yester/night/give/ You / sent / one / seal / mail”.
  • "yesterday” and “late” are completed by a string input operation, that is, "last night” is a string that cannot be divided. That is, "last night” is the target character string at the junction between the target words "yes” and "late”.
  • the target words “yesterday” and “late” are combined to get “I/last night/give/you/fare/one/close/mail".
  • the case where the adjacent target words need to be merged includes not only the target character string at which the convergence of the target word is split, but also the adjacent two existing at the junction of the adjacent target words.
  • the target string and the two adjacent target strings correspond to an input time that is sufficiently small.
  • the method further includes: acquiring a target character string at the junction between adjacent target words in the input text (previous target words in adjacent target words) Determining the delayed input of the target string, including the target string adjacent to the subsequent target word, and the target string adjacent to the previous target word in the character string included in the subsequent target word) The duration, when the delay input duration is less than the preset second threshold, the adjacent target words are merged.
  • the input text input by the user is "I visited the Forbidden City Museum today", and at least one target word obtained by segmenting the input text through the word segmentation component is "I/Today/Visit/Yu/Forbidden City” /museum".
  • the target character string between the target word “Forbidden City” and “Museum” includes two target strings of "Forbidden City” and “Museum”, and by calculating the "Forbidden City” corresponding to the target string "Museum” adjacent to the posterior
  • the interval of the string input operation is 0.5s, and the preset second threshold is 1s. Therefore, the delay input duration of the target string "Forbidden City” is less than the second threshold. In this case, the "Forbidden City” is considered. There is no need to split between the two target strings of "Museum”, so the target word “Forbidden City” and “Museum” are merged to get “I/Today/Visit/He/National Palace Museum”.
  • determining a timestamp corresponding to the character included in the input text determining at least one target word, acquiring a timestamp corresponding to a last character of the target word, and acquiring a first character corresponding to the target word adjacent to the target word. Timestamp, then calculating a time interval corresponding to the two timestamps, if the time interval is less than a preset time threshold (ie, the second threshold, and the second threshold is less than the first threshold), determining that the target word is adjacent to the back There should be no segmentation between target words, so the segmentation between the target word and the next adjacent target word can be canceled.
  • a preset time threshold ie, the second threshold, and the second threshold is less than the first threshold
  • the second threshold may be 0.3 s. If the time interval of the input time corresponding to the two characters is less than 0.3 s (for example, the time interval is 0 s), the segmentation between the target word and the next adjacent target word is canceled or The target word is merged with the target word adjacent to it.
  • different user input speeds are also required. For example, users who frequently use computer typing can speed up typing faster than older users who do not use computers frequently. In this case, if All users still use the same time threshold when performing word segmentation according to the delay input duration, which may lead to incomplete word segmentation or inaccurate word segmentation.
  • the word segmentation method further includes: detecting a text input speed in the text input component, and determining the first threshold according to the text input speed. That is to say, when the user inputs text through the text input component, it is also required to detect the text input speed of the user input text, for example, the average input character number per unit time, and then the text input speed determines the specific value of the first threshold value. .
  • the system presets a correspondence between the text input speed and the value of the first threshold, and after detecting the input of the text through the text input component, according to the detected text input speed at the preset text input speed and the first A first threshold corresponding to the detected text input speed is searched for as a first threshold in the word segmentation process in the correspondence between the values of the threshold values.
  • the text input speed may be not only the detected text input speed of the input text through the text input component, but also the text input speed determined according to the user's historical string input operation, for example, obtaining user input.
  • the historical data of the text input speed corresponding to the string input operation or the text input operation, and the text input speed corresponding to the user is determined according to the historical data, and then the first threshold corresponding thereto is determined according to the determined text input speed.
  • the historical data of the text input speed corresponding to the character string input operation or the text input operation input by the user may be obtained by acquiring an account with the current login (for example, an account registered in the input method).
  • the historical data corresponding to the text input operation corresponding to the account registered in the word segment application may also be the historical data corresponding to the input text input operation through the current terminal.
  • the user may click the send button to send the text input through the text input component, in this case.
  • the input text is processed by word segmentation. That is to say, when the user inputs through the text input component, the text input operation of the user is detected only by the text input component, and the character string or text input by the text input operation and the time stamp corresponding thereto are detected, and then the user clicks The input text is processed for word segmentation when the button is sent.
  • the user can also trigger the execution of the word segment by other operations, such as inputting text, or importing input text, or saving the input text.
  • the target character string corresponding to each string input operation input by the user and the time stamp are detected, and after the user completes all the text input, the target is The string and the corresponding timestamp are sent to the word segmentation module for processing.
  • a word segmentation device is also proposed.
  • the word segmentation device includes a text input operation detecting module 102, an input text display module 104, a delay input duration calculating module 106, and a word segmentation module 108, wherein:
  • the text input operation detecting module 102 is configured to detect a text input operation in the text input component, where the text input operation includes one or more string input operations in sequence;
  • the input text display module 104 is configured to sequentially splicing one or more target character strings input by the one or more character string input operations, and display the spliced input text in the text input component.
  • Each of the target strings is a string input by a string input operation;
  • the delay input duration calculation module 106 is configured to obtain a delay input duration of each target character string, where the delay input duration is a string input operation corresponding to the string input operation corresponding to the target string and the adjacent target string Interval length;
  • the word segmentation module 108 is configured to segment the input text by using a delay input duration of each target character string as a word segmentation condition.
  • the word segmentation module 108 is further configured to perform segmentation on the input text by the word segmentation component to obtain at least one target word, the target word includes a plurality of target character strings; and obtain a corresponding delay in the target word. Entering an end position of the target character string whose duration is greater than or equal to the first threshold; and segmenting the target word according to the end position.
  • the word segmentation module 108 is further configured to acquire a first target character string and a second target character string at the junction between two adjacent target words in the input text, wherein the first target character The string and the second target character string respectively belong to the two target words, and if the interval length of the character string input operation corresponding to the first target character string and the second target character string is less than a predetermined second threshold, the phase is The adjacent target words are merged.
  • the word segmentation module 108 is further configured to obtain, in the input text, a corresponding end position of the target character string whose delay input duration is greater than or equal to the first threshold; and divide the input text into the end position according to the end position. At least one text segment; separate word segmentation for each text segment.
  • the delayed input duration calculation 106 module is further configured to obtain a first timestamp of the string input operation corresponding to each target string, and a second string input operation corresponding to the adjacent target string. a timestamp; determining a delay input duration of the target character string according to the first timestamp and the second timestamp.
  • the word segmentation module 108 is further configured to obtain a separation position of the separator included in the input text, and divide the input text into at least one text segment according to the separation position; The delayed input duration of the target character string in the text segment is segmented.
  • the apparatus further includes a threshold determining module 110, configured to detect a text input speed in the text input component, and determine the first threshold according to the text input speed, where The text input speed is the average number of characters input per unit time.
  • the input time of each character or character string input by the user is recorded, and the interval between adjacent two characters or a character string is calculated. Then, when the interval duration is greater than a preset value, in the process of word segmentation, a segmentation is performed between the adjacent two characters or strings. That is to say, in the process of word segmentation processing on the input text input by the user, considering the time of each character or character string input by the user, if the interval between the two characters or the character string input by the user is long, it is considered There is no semantic neighbor between these two characters or strings and they are separated. With the embodiment of the present application, the actual situation of the user when inputting the text is considered in the process of word segmentation, and the accuracy of the word segmentation is improved.
  • the above embodiments it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • a software program it may be implemented in whole or in part in the form of a computer program product.
  • the computer program product includes one or more computer instructions.
  • the computer program instructions When the computer program instructions are loaded and executed on a computer, the processes or functions described in accordance with embodiments of the present application are generated in whole or in part.
  • the computer can be a general purpose computer, a special purpose computer, a computer network, or other programmable device.
  • the computer instructions can be stored in a computer readable storage medium or transferred from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions can be from a website site, computer, server or data center Transfer to another website site, computer, server, or data center by wire (eg, coaxial cable, fiber optic, digital subscriber line (DSL), or wireless (eg, infrared, wireless, microwave, etc.).
  • the computer readable storage medium can be any available media that can be accessed by a computer or a data storage device such as a server, data center, or the like that includes one or more available media.
  • the usable medium may be a magnetic medium (eg, a floppy disk, a hard disk, a magnetic tape), an optical medium (eg, a DVD), or a semiconductor medium (such as a solid state disk (SSD)) or the like.
  • a magnetic medium eg, a floppy disk, a hard disk, a magnetic tape
  • an optical medium eg, a DVD
  • a semiconductor medium such as a solid state disk (SSD)
  • FIG. 6 illustrates a terminal of a von Neumann system-based computer system that runs the above-described word segmentation method.
  • the computer system can be a smart home device such as a smart phone, a tablet computer, a palmtop computer, a notebook computer, a personal computer, a head-mounted device, a wearable device, a smart speaker, and the like.
  • an external input interface 1001, a processor 1002, a memory 1003, and an output interface 1004 connected through a system bus may be included.
  • the external input interface 1001 can include at least a network interface 10012.
  • the memory 1003 may include an external memory 10032 (eg, a hard disk, an optical disk, or a floppy disk, etc.) and an internal memory 10034.
  • the output interface 1004 can connect at least devices such as the display 10042.
  • the operation of the method is based on a computer program whose program file is stored in the external memory 10032 of the aforementioned von Neumann system-based computer system, loaded into the internal memory 10034 at runtime, and then After being compiled into a machine code, it is passed to the processor 1002 for execution, so that a logical text input operation detecting module 102, an input text display module 104, a delayed input duration calculating module 106, and a logical input processing module 106 are formed in the von Neumann system-based computer system.
  • the word segmentation module 108 and the threshold determination module 110 are formed in the von Neumann system-based computer system.
  • the input parameters are received by the external input interface 1001, and transferred to the buffer in the memory 1003, and then input to the processor 1002 for processing, and the processed result data is cached in the memory 1003 for subsequent processing.
  • the ground is processed or passed to the output interface 1004 for output.
  • processor 1002 is configured to perform the following operations:
  • Detecting a text input operation in a text input component the text input operation comprising one or more string input operations in sequence;
  • each target character string is once String input by the string input operation
  • the delay input duration is an interval length of a string input operation corresponding to the string input operation corresponding to the target string and the adjacent target string;
  • the input text is segmented by using the delayed input duration of each target string as a word segmentation condition.
  • the processor 1002 is further configured to perform segmentation on the input text by using a word segmentation component to obtain at least one target word, the target word includes a plurality of target character strings; and acquiring the target word, corresponding to The end position of the target character string whose delay input time is greater than or equal to the first threshold value is delayed; the target word is segmented according to the end position.
  • the processor 1002 is further configured to acquire a first target character string and a second target character string at a junction between two adjacent target words in the input text, wherein the first The target string and the second target string respectively belong to the two target words; if the interval length of the string input operation corresponding to the first target string and the second target string is less than a preset second threshold, The two adjacent target words are merged.
  • the processor 1002 is further configured to acquire, in the input text, an end position of the target character string whose corresponding delay input duration is greater than or equal to the first threshold; and the input text is cut according to the end position. Divided into at least one text segment; separate word segmentation for each text segment.
  • the processor 1002 is further configured to acquire a first timestamp of the character string input operation corresponding to the target character string, and a second timestamp of the string input operation corresponding to the adjacent target character string. Determining a delay input duration of the target character string according to the first timestamp and the second timestamp.
  • the processor 1002 is further configured to obtain a separation position of the separator included in the input text, and divide the input text into at least one text segment according to the separation position; The word segmentation is performed according to the delay input duration of the target character string in the text segment.
  • the processor 1002 is further configured to detect a text input speed in the text input component, and determine the first threshold according to the text input speed, wherein the text input speed is an average input per unit time The number of characters.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • User Interface Of Digital Computer (AREA)
  • Machine Translation (AREA)

Abstract

本申请实施例公开了一种分词方法,包括:检测文本输入组件中的文本输入操作,所述文本输入操作包括顺序的一次或一次以上的字符串输入操作;将所述一次或一次以上的字符串输入操作所输入的一个或一个以上的目标字符串进行顺序拼接,并将拼接成的输入文本展示在所述文本输入组件中,其中每个目标字符串为一次字符串输入操作所输入的字符串;获取每个目标字符串的延迟输入时长,所述延迟输入时长为所述目标字符串对应的字符串输入操作与其相邻的目标字符串对应的字符串输入操作的间隔时长;将每个目标字符串的延迟输入时长作为分词条件对所述输入文本进行分词。

Description

分词方法、装置及存储介质
本申请要求于2017年4月7日提交中国专利局、申请号为201710224889.5,发明名称为“分词方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,尤其涉及一种分词方法及装置。
背景技术
人工智能(Artificial Intelligence,英文缩写为AI),是对人的意识、思维的信息过程的模拟,而理解用户输入的自然语言是人工智能的一个很重要的课题,尤其是针对中文输入的自然语言的理解。不同于英文空格可以自动标识词的便捷,中文文本是以字为基本的书写单位,词与词之间没有空格,而词可以由单个字或者多个字构成,即一个词包含的字数是不定的。因此,理解中文输入的文本的第一步就是分词,即进行词的正确切分。
目前常用的分词方法主要包括以下三种:基于字符串匹配的分词方法、基于理解的分词方法和基于统计的分词方法。基于字符串匹配的分词方法是按照一定的策略将待分析的汉字串与机器词典中的词条进行匹配,若在词典中找到某个词,则匹配成功(识别出一个词)。基于理解的分词方法是通过让计算机模拟人对句子的理解,达到识别词的效果,但也因此需要使用大量的语言知识和信息。基于统计的分词方法对语料中相邻共现的各个字的组合的频度进行统计,计算它们的互现信息并计算两个汉字的相邻共现概率,来判断两个词是不是需要切分。
技术内容
本申请实施例提出了一种分词方法,以提高对用户输入的文本进行分词的准确度。
一种分词方法,包括:
检测文本输入组件中的文本输入操作,所述文本输入操作包括顺序的一次或一次以上的字符串输入操作;
将所述一次或一次以上的字符串输入操作所输入的一个或一个以上的目标字符串进行顺序拼接,并将拼接成的输入文本展示在所述文本输入组件中,其中每个目标字符串为一次字符串输入操作所输入的字符串;
获取每个目标字符串的延迟输入时长,所述延迟输入时长为所述目标字符串对应的字符串输入操作与其相邻的目标字符串对应的字符串输入操作的间隔时长;
将每个目标字符串的延迟输入时长作为分词条件对所述输入文本进行分词。
此外,本申请实施例还提出了一种分词装置。
一种分词装置,包括:
处理器;
与所述处理器相连接的存储器;所述存储器中存储有机器可读指令模块;所述机器可读指令模块包括:
文本输入操作检测模块,用于检测文本输入组件中的文本输入操作,所述文本输入操作包括顺序的一次或一次以上的字符串输入操作;
输入文本展示模块,用于将所述一次或一次以上的字符串输入操作所输入的一个或一个以上的目标字符串进行顺序拼接,并将拼接成的输入文本展示在所述文本输入组件中,其中每个目标字符串为一次字符串 输入操作所输入的字符串;
延迟输入时长计算模块,用于获取每个目标字符串的延迟输入时长,所述延迟输入时长为所述目标字符串对应的字符串输入操作与其相邻的目标字符串对应的字符串输入操作的间隔时长;
分词模块,用于将每个目标字符串的延迟输入时长作为分词条件对所述输入文本进行分词。
本申请实施例还提供了一种非易失性计算机可读存储介质,其中所述存储介质中存储有机器可读指令,所述机器可读指令可以由处理器执行以完成上述方法。
附图简要说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
其中:
图1为一个实施例中一种分词方法的流程示意图;
图2为一个实施例中通过文本输入组件进行输入的示意图;
图3为一个实施例中通过文本输入组件进行输入的示意图;
图4为一个实施例中通过文本输入组件进行输入的示意图;
图5为一个实施例中一种分词装置的结构示意图;
图6为一个实施例中运行前述分词方法的计算机设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案 进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
一些分词方法是按照某一个固定的模式来进行分词,而有些文本会存在多种分词方式,每一种分词方式可能会有不同的含义,这就导致了分词的结果不够准确;并且,用户在输入文本时,会附带用户的习惯,例如,打字的习惯,或者打错别字等,传统的分词方法均没有考虑到用户的输入文本时的真实情况,忽略了用户真实的需求,也导致了分词的准确度不足。
为提高对用户输入的中文文本进行分词的分词准确度,在本实施例中,特提出了一种分词方法,该方法的实现可依赖于计算机程序,该计算机程序可运行于基于冯诺依曼体系的计算机系统之上,该计算机程序可以是终端或者服务器上的分词应用,或者是集成在提供了文本输入组件来接收用户输入的字符串或者文本的应用中的文本分词应用程序。该计算机系统可以是运行上述计算机程序的例如智能手机、平板电脑、个人电脑等服务器或终端。
具体的,如图1所示,上述分词方法包括如下步骤S102-S108:
步骤S102:检测文本输入组件中的文本输入操作,所述文本输入操作包括顺序的一次或一次以上的字符串输入操作。
文本输入组件指的是可供用户输入文本的组件,例如,文本输入框、文档输入界面、搜索输入框等可供用户输入单行或多行文字的组件。例如,在如图2所示即时通讯软件的聊天界面中,用户可以通过展示的文本输入框201进行输入,并且在完成输入之后点击“发送”按钮将输入的文本进行发送。需要说明的是,在本实施例中,文本输入组件不限于图 2所示的即时通讯软件的聊天界面中的文本输入框201,还可以是其他文字输入框,例如,搜索输入框,或者Word等文字编辑软件的文字编辑窗口等,均为文本输入组件。
在本实施例中,不对文本输入组件的展示场景进行限制,即可以是搜索界面的搜索输入框,也可以是对话窗口的文字输入框,还可以是文档编辑页面的文字编辑窗口,只要是用户可以通过展示的文本输入组件输入文字。
在用户通过文本输入组件输入文字时,可以通过文本输入组件检测用户的输入操作,并获取用户输入的具体内容,用户通过文本输入组件输入具体的内容的操作即为文本输入操作。
需要说明的是,在本实施例中,用户单次的文本输入操作即为字符串输入操作,用户的文本输入操作包含了至少一个字符串输入操作,即,用户输入文本输入操作时输入的可以是单个的字符,也可以是多个字符(例如,通过输入法一次性输入包含了多个字符的词,如在图2所示的应用场景中,用户通过拼音输入法输入“zhuanli”来一次性输入字符串“专利”)。
在一些文本输入方式中,大多用户在进行输入时,通过安装的输入法(例如拼音输入法)来进行输入,例如,通过在终端上安装的搜狗输入法包含的拼音输入法进行输入。例如,在图2-4所示的应用场景中,用户通过安装的拼音输入法进行输入。在图2-4所示的应用场景中,在用户输入与目标字符串对应的拼音之后,展示与输入的拼音对应的候选词展示框202a~202d,用户可在展示的候选词展示框202a~202d中选择需要输入的字符或字符串。
进一步的,因为输入法在展示候选词时,会自动根据用户的输入来展示与之匹配的词语,从而减少用户在展示的候选词中选择需要输入的 目标字符时需要输入的操作次数。因此,多数用户在输入文本时,会将需要输入的词语通过一次字符串输入操作来输入。
在一个具体的实施例中,如图3所示,在用户输入“zhuan”时,展示如图3所示的候选词展示框202b,通过与编号3对应的选择操作来输入“专”;在用户输入“li”时,展示如图4所示的候选词展示框202c,用户需要通过输入一次翻页操作以及一次与编号“2”对应的选择操作来输入“利”,来完成字符串“专利”的输入。
而在如图2所示的应用场景中,用户可以通过输入“zhuanli”来一次性输入字符串“专利”,并且只需要输入与编号“1”对应的选择操作即可在展示的候选词展示框202a中选择输入“专利”来完成输入。
因此,在本实施例中,用户通过一次字符串输入操作所输入的目标字符串可以是单词的字符,也可以是由多个字符组成的字符串。也就是说,一次字符串输入操作对应的是用户一次性在文本输入组件中输入的字符串,在该字符串包含多个字符时,在该字符串中包含的多个字符是通过输入到文本输入组件中去的,即该字符串中包含的多个字符对应的输入时间是相同的。
一般情况下,用户在通过文本输入组件输入文本时,是依次输入的文本包含的多个字符串,并且,输入的文本中在前的字符串的输入时间会限于在后的字符串输入的时间。即,文本输入操作包括了顺序的至少一次字符串输入操作。
步骤S104:将所述字符串输入操作输入的一个或一个以上的目标字符串进行顺序拼接,并将拼接成的输入文本展示在所述文本输入组件中。本实施例中,所述字符串输入操作的次数包括一次或一次以上,所述目标字符串的个数为一个或一个以上,目标字符串按照输入顺序进行拼接。
在本实施例中,每一个字符串输入操作均对应一个输入的字符串, 即与该字符串输入操作对应的目标字符串。相应的,字符串输入操作的输入时间即为与该字符串输入操作对应的目标字符串的输入时间,因此,可以确定用户的文本输入操作所对应的所有的目标字符串的输入时间,并且,任意两个目标字符串是输入时间不同。
根据检测到的字符串输入操作,获取检测到的字符串输入操作所对应的目标字符串,以及,输入每一个目标字符串的输入时间(即字符串输入操作的输入时间);然后根据每一个目标字符串的输入时间,确定目标字符串的输入顺序,然后将目标字符串顺序拼接成输入文本并展示在文本输入组件中。
需要说明的是,在本实施例中,在文本输入组件中展示的输入文本为当前已经完成输入的字符串输入操作所对应的目标字符串所拼接成的,用户还可以继续通过文本输入组件输入字符串输入操作来输入目标字符串。
步骤S106:获取每个目标字符串的延迟输入时长,所述延迟输入时长为所述目标字符串对应的字符串输入操作与其相邻的目标字符串对应的字符串输入操作的间隔时长。
目标字符串的延迟输入时长指的是输入该目标字符串的输入时间与输入相邻的目标字符串的输入时间之间的时间差或间隔时间,即该目标字符串对应的字符串输入操作与相邻的目标字符串对应的字符串输入操作的间隔时长。
需要说明的是,在本实施例中,在确定目标字符串的延迟输入时长时,与该目标字符串相邻的目标字符串,可以是与该目标字符串前相邻的目标字符串(即输入时间在该目标字符串之前且与该目标字符串相邻的目标字符串),也可以是与该目标字符串后相邻的目标字符串(即输入时间在该目标字符串之后且与该目标字符串相邻的目标字符串),在 本实施例中不做限定,但是,所有的目标字符串的延迟输入时长的计算中,是采用前相邻或者后相邻的目标字符串的输入时间,必须在前相邻的目标字符串与后相邻的目标字符串中选择一个,并且在针对所有的目标字符串计算与之对应的延迟输入时间时,选择必须是一致的。
在一个具体的实施例中,采用的是通过计算当前目标字符串与后相邻的目标字符串的输入时间的时间间隔作为与该目标字符串对应的延迟输入时长的情况下,上述获取目标字符串的延迟输入时长的过程具体为:获取目标字符串对应的字符串输入操作的第一时间戳,以及相应的后相邻的目标字符串对应的字符串输入操作的第二时间戳;根据第一时间戳和第二时间戳确定目标字符串的延迟输入时长。
输入目标字符串的字符串输入操作对应的输入时间,即为与该字符串输入操作对应的时间戳,即第一时间戳;与该目标字符串后相邻的目标字符串对一个的字符串输入操作的输入时间,即为与后相邻的目标字符串的字符串输入操作对应的时间戳,即第二时间戳。第一时间戳与第二时间戳之间的时间间隔即为该目标字符串与后相邻的目标字符串的输入时间之间的间隔时长,也就是说,第一时间戳与第二时间戳之间的间隔时长为与目标字符串对应的延迟输入时长。
在另一个具体的实施例中,采用的是通过计算当前目标字符串与前相邻的目标字符串的输入时间的时间间隔作为与该目标字符串对应的延迟输入时长的情况下,上述获取目标字符串的延迟输入时长的过程具体为:获取目标字符串对应的字符串输入操作的第三时间戳,以及相应的前相邻的目标字符串对应的字符串输入操作的第四时间戳;根据第三时间戳和第四时间戳确定目标字符串的延迟输入时长。
输入目标字符串的字符串输入操作对应的输入时间,即为与该字符串输入操作对应的时间戳,即第三时间戳;与该目标字符串前相邻的目 标字符串对三个的字符串输入操作的输入时间,即为与前相邻的目标字符串的字符串输入操作对应的时间戳,即第四时间戳。第三时间戳与第四时间戳之间的时间间隔即为该目标字符串与前相邻的目标字符串的输入时间之间的间隔时长,也就是说,第三时间戳与第四时间戳之间的间隔时长为与目标字符串对应的延迟输入时长。
步骤S108:将每个目标字符串的延迟输入时长作为分词条件对所述输入文本进行分词。
一般来讲,在用户手动打字输入文字时,部分包含了多个单字的词语会一次性输入,尤其使用拼音输入法的用户,因为输入法会根据用户输入的拼音自动去匹配词语,避免了用户单字输入时需要在展示的与输入的拼音匹配的选项中进行查找或者翻页等操作,从而提高打字的速度。
例如,在用户输入“我要吃火锅”时,“火锅”二字一般会通过拼音“huoguo”一次性输入,避免在单个字输入时需要在待选项中进行选择。在此种情况下,用户输入“火锅”的输入操作即为一次字符串输入操作,字符“火”、“锅”对应的时间戳是一致的。
需要说明的是,在输入上述“我要吃火锅”中的“火锅”时,如用户输入“火锅”时,并不是一次性完成了“火”、“锅”两个单字的输入,而是分别输入“火”、“锅”两个单字,而用户在输入例如“火”时,下一个输入的字符一定会是“锅”,用户在输入时不需要思考或者停顿,因此,“火”、“锅”两个单字的输入时间之间的间隔时间也会较小。例如,按照一般用户的输入习惯输入“我...要...吃......火..锅”,单字“火”与“锅”之间的输入时间的时间间隔会明显小于单字“吃”与“火”之间的输入时间的时间间隔。
进一步的,用户在手动打字输入文字时,会因为思考、停顿等原因,输入的字符或者字符串之间的间隔时间是不一样的,有的会比较短,而部分字符之间的间隔时间会比较长。
例如,在用户输入“我明天早上七点的飞机去重庆”时,输入的字符串分别为“我”“明天”“早上”“七点”“的”“飞机”“去”“重庆”,按照一般用户的输入习惯“我....明天...早上...七点...的...飞机...........去....重庆”,“明天”与“早上”之间的时间间隔会明显小于“飞机”与“去”之间的时间间隔。
也就是说,大多数用户在输入文本时,语义上连接或者不可切分的字符会一次性输入或者两个字符之间的时间间隔较短,而语义上不连接或者需要进行切分的字符的输入时间之间的时间间隔一般会大于语义上连接或者不可切分的字符的输入时间之间的时间间隔。即,若输入两个字符或者字符串的输入时间之间存在较大的时间间隔,则说明用户在输入时因为表达的习惯或连续性、或因为思考,而停顿了较长时间,在此种情况下,这两个字符或者字符串语义上是不连续的或者是要被切分的。
在本实施例中,在所有的目标字符串的延迟输入时长确定之后,即可根据目标字符串的延迟输入时长对输入文本进行切分,即将所述目标字符串的延迟输入时长作为分词条件对所述输入文本进行分词。也就是说,根据目标字符串的延迟输入时长对输入文本进行分词时,需要考虑目标字符串的延迟输入时长,具体的,需要将目标字符串的延迟输入时长作为分词决策中何处划分单词的一个影响因子,一个目标字符串与相邻的目标字符串之间是否进行切分需要考虑该目标字符串的延迟输入时长的大小。
在一个具体的实施例中,以目标字符串的延迟输入时长是通过该目标字符串与后相邻的目标字符串是输入时间的间隔时间确定的情况为例进行说明。
上述将所述目标字符串的延迟输入时长作为分词条件对所述输入文本进行分词的过程具体为:获取所述输入文本中,对应的延迟输入时 长大于或等于第一阈值的目标字符串的结束位置;根据所述结束位置将所述输入文本切分成至少一个文本段;对所述文本段分别进行分词处理。
针对输入文本包含的所有的目标字符串,确定与每一个目标字符串对应的延迟输入时长;因为每一个目标字符串的延迟输入时长标识的是该字符串与后相邻的目标字符串对应的字符串的输入时间的时间间隔,目标字符串的延迟输入时长越大,该目标字符串与后相邻的目标字符串之间的停顿时间越长,二者存在语义连接或者不可被切分的可能性越低。因此,在目标字符串的延迟输入时长大于预设的时间阈值(第一阈值)时,在目标字符串与后相邻的目标字符串之间进行切分,即在目标字符串的结束位置进行切分。
需要说明的是,在本实施例中,输入文本包含了不止一个目标字符串,根据目标字符串的延迟输入时间对输入文本进行切分的过程中满足切分条件的目标字符串的延迟输入时间的数量也不止一个。因此,在延迟输入时长大于或等于预设的时间阈值的情况下,根据目标字符串的结束为止对输入文本进行切分之后,可以将输入文本切分成多个文本段,每个文本段包含了一个或者多个目标字符串。
需要说明的是,在本实施例中,在用户手动打字来进行输入的过程中,若用户手动打字的速度较快或者打字的过程中没有停顿,部分在分词的过程中应该被切分的两个字符或字符串的输入时间的时间间隔也可能较小,从而导致了在上述根据字符串的延迟输入时长对输入文本进行切分时没有进行切分。在此种情况下,可以对切分得到的文本段进行进一步的分词处理,例如,使用其他分词方法或者分词组件对文本段进行进一步的分词处理,以提高分词的准确度和有效性。
例如,针对用户输入的输入文本“我刚刚买了一张明天早上七点飞北京的机票”,确定每一个字符或字符串的延迟输入时间如下(其中,目标 字符串的延迟输入时长是通过该目标字符串与后相邻的目标字符串是输入时间的间隔时间确定的):
表1
字符或字符串 延迟输入时间(单位:秒)
1.5
刚刚 1.3
0.5
1.6
一张 2.0
明天 1.4
早上 1.5
七点 2.6
1.8
北京 0.5
1.6
机票
在预设的第一阈值为1.8的情况下,对上述输入文本进行分词得到的结果为“我刚刚买了一张/明天早上七点/飞/北京的机票”,显然,这并不是最终的分词结果。对上述分词结果继续采用其他分词算法进行分词,可以得到最终的分词结果为“我/刚刚/买/了/一/张/明天/早上/七点/飞/北京/的/机票”。
需要说明的是,在本实施中,在对文本段分别进行分词处理的过程中可以采用已知的分词方法或用户自定义的分词方法的一种或多种的组合,对于具体采用的分词方法,在本实施例中不做限制。
例如,采用的分词方法可以是基于字符串匹配的分词方法,即机械分词方法。按照一定的策略将待分析的汉字串与机器词典中的词条进行匹配,若在词典中找到某个词,则匹配成功(识别出一个词)。按照扫描方向的不同,字符串匹配方法可以分为正向匹配和逆向匹配;按照不同长度优先匹配的标准,又可以分为最大(最长)匹配和最小(最短)匹配。
在另一个实施例中,还可以采用基于理解的分词方法面积通过计算机模拟人对句子的理解,达到识别词的效果,即在分词的同时进行句法、语义分析,利用句法信息和语义信息来处理歧义现象。
在另一个实施例中,还可以采用基于统计的分词方法。从形式上看,词是稳定的字的组合,因此在上下文中,相邻的字同时出现的次数越多,就越有可能构成一个词。因此字与字相邻共现的频率或概率能够较好的反映成词的可信度。可以对语料中相邻共现的各个字的组合的频度进行统计,计算它们的互现信息。定义两个字的互现信息,计算两个汉字的相邻共现概率。互现信息体现了汉字之间结合关系的紧密程度。当紧密程度高于某一个阈值时,便可认为此字组可能构成了一个词。也就是说,只需对语料中的字组频度进行统计,不需要切分词典,因而又叫做无词典分词法或统计取词方法。
用户在通过文本输入组件输入文本时,也会根据需求输入标点符号、数字、英文缩写或英文字母、或者其他非汉字符号,例如,“我昨晚7点在QQ上给你发了消息,你看见了吗”中的“7”、“QQ”、“,”均为非汉字字符。
一般来讲,在对输入文本进行分词时,词语由若干个汉字字符组成,或者由多干个英文字符组成,而不是汉字字符与非汉字字符的组合。也就是说,汉字字符与非汉字字符之间为停顿、或者进行切分的位置。
因此,在输入文本中包含了标点符号、数字、英文缩写或英文字母、或者其他非汉字符号时,可以直接将汉字字符与上述非汉字字符之间进行切分。具体的,在上述将所述目标字符串的延迟输入时长作为分词条件对所述输入文本进行分词的过程中,获取所述输入文本中包含的分隔符的分隔位置,按照所述分隔位置将所述输入文本切分成至少一个文本段;对所述文本段分别参考所述文本段中的目标字符串的延迟输入时长进行分词。
其中,分隔符即为标点符号、英文符号、英文字母、以及其他非汉字符号。在对输入文本进行分词的第一步,可以先根据输入文本中包含的分隔符所在的位置,对输入文本进行切分,将分隔符与输入文本中的其他字符切分开来,从而将输入文本切分成若干个文本段,然后再对文本段进行分词。
例如,在输入文本为“我昨晚7点在QQ上给你发了消息,你看见了吗”的情况下,根据分隔符对上述输入文本进行切分,得到“我昨晚/7/点在/QQ/上给你发了消息/,/你看见了吗”,包含了“我昨晚”、“点在”、“上给你发了消息”、“你看见了吗”4个文本段,然后再对上述4个文本段分别进行分词处理。采用本实施例,预先根据分隔符对输入文本进行分词,减少了在后续进行分词处理时的计算量,提高了分词处理的分词效率。
需要说明的是,用户在手动输入输入文本时,可以通过手写、软键盘手动的输入一个字符或者字符串,还可能是直接通过复制粘贴的方式一次性输入多个字符串,在此种情况下,复制粘贴的所有的字符或字符串对应的输入时间是相同的。针对输入文本中包含了通过复制粘贴的相邻的至少一个字符串的情况下,根据延迟输入时间直接将复制粘贴的所有字符划分成一个文本段,然后再对该文本段进行分词处理。
在另一个实施例中,上述将所述目标字符串的延迟输入时长作为分 词条件对所述输入文本进行分词的过程具体为:通过分词组件对所述输入文本进行分词得到至少一个目标单词;获取所述目标单词中,对应的延迟输入时长大于或等于第一阈值的目标字符串的结束位置;根据所述结束位置对所述目标单词进行分词。
也就是说,针对需要进行分词的输入文本,首先采用其他分词算法对输入文本进行分词,例如,采用最少切分算法对输入文本进行分词,得到与输入文本对应的至少一个目标单词。因为得到的多个目标单词可能存在分词不完全的问题,因此,针对通过分词组件对输入文本进行分词处理得到的至少一个目标单词,获取目标单词中所包含的所有的字符或字符串所对应的延迟输入时长,然后,根据字符或字符串所对应的延迟输入时长对目标单词进行分词。
需要说明的是,在本实施例中,上述分词组件可以基于任意的分词算法(包括用户自定义的分词算法)或者多个分词算法,例如,可以是基于最大正向匹配法、逆向最大匹配法、最少切分法、双向匹配法或全切分算法等分词算法中的一种或者多种的组合。
需要说明的是,在本实施例中,在根据分词组件对输入文本得到的至少一个目标单词可能存在切分不完全的问题,因此需要进行进一步的分词处理;但是,在另一个实施例中,在根据分词组件对输入文本进行分词得到的至少一个目标单词也可能存在切分过度的问题,即将不应该被切分的包含不止一个单字的词语切分开来。在此种情况下,可以根据输入文本中切分得到的所有的目标单词中,相邻的目标单词的衔接处是否存在目标字符串(被相邻的像个目标单词切分开的目标字符串),如果存在,则将该相邻的两个目标单词合并。
具体的,通过分词组件对所述输入文本进行分词得到至少一个目标单词之后还包括:获取在所述输入文本中相邻的目标单词之间衔接处的 目标字符串,将所述相邻的目标单词合并,并且,若在输入文本中相邻的目标单词之间衔接处不存在目标字符串,则跳过合并目标单词的步骤,直接执行获取所述目标单词中,对应的延迟输入时长大于或等于第一阈值的目标字符串的结束位置,根据所述结束位置对所述目标单词进行分词。
例如,在一个具体的实施例中,输入文本为“我昨晚给你发了一封邮件”,通过分词组件对输入文本进行分词得到的至少一个目标单词为“我/昨/晚/给/你/发了/一/封/邮件”。而根据用户输入该输入文本的字符串操作中,“昨”和“晚”是通过一个字符串输入操作完成的,即“昨晚”为一个不可被切分的字符串。也即,“昨晚”为目标单词“昨”和“晚”之间的衔接处的目标字符串。在此种情况下,将目标单词“昨”和“晚”进行合并,得到“我/昨晚/给/你/发了/一/封/邮件”。
在另一个实施例中,需要将相邻的目标单词合并的情况不仅包括了上述目标单词的衔接处被拆分的目标字符串,还包括相邻的目标单词的衔接处存在的相邻的两个目标字符串而这两个相邻的目标字符串对应的输入时间足够小的情况。
即,通过分词组件对输入文本进行分词得到至少一个目标单词之后还包括:获取在输入文本中相邻的目标单词之间的衔接处的目标字符串(相邻的目标单词中在先的目标单词中包含的与在后的目标单词相邻的目标字符串,以及,在后的目标单词中包含的字符串中与在先的目标单词相邻的目标字符串),确定目标字符串的延迟输入时长,在该延迟输入时长小于预设的第二阈值的情况下,将上述相邻的目标单词进行合并。
例如,在一个具体的实施例中,用户输入的输入文本为“我今天参观了故宫博物馆”,通过分词组件对输入文本进行分词得到的至少一个目标 单词为“我/今天/参观/了/故宫/博物馆”。目标单词“故宫”与“博物馆”之间的目标字符串包括了“故宫”、“博物馆”这两个目标字符串,并且,通过计算“故宫”与后相邻的目标字符串“博物馆”对应的字符串输入操作的间隔时长为0.5s,而预设的第二阈值为1s,因此,目标字符串“故宫”的延迟输入时长小于第二阈值、在此种情况下,认为“故宫”、“博物馆”这两个目标字符串之间不需要进行切分,因此,将目标单词“故宫”与“博物馆”进行合并,得到“我/今天/参观/了/故宫博物馆”。
具体的,确定输入文本包含的字符对应的时间戳,确定至少一个目标单词,获取目标单词的最后一个字符对应的时间戳,获取与该目标单词后相邻的目标单词的第一个字符对应的时间戳,然后计算两个时间戳对应的时间间隔,如果该时间间隔小于预设的时间阈值(即第二阈值,且,第二阈值小于第一阈值),判定该目标单词与后相邻的目标单词之间不应该进行切分,因此,可以取消该目标单词与后相邻的目标单词之间的切分。
例如,第二阈值可以为0.3s,若两个字符对应的输入时间的时间间隔小于0.3s(例如时间间隔为0s)时,取消对目标单词与后相邻的目标单词之间的切分或将目标单词与后相邻的目标单词进行合并。
在另外一个实施例中,还需要不同的用户的输入速度,例如,经常使用电脑打字的用户打字的速度会明显快于不常使用电脑的老年用户的输入速度,在此种情况下,如果针对所有的用户在根据延迟输入时长进行分词处理时依旧采用相同的时间阈值,会导致分词不完全或者分词不准确的问题。
因此,在一个具体的实施例中,上述分词方法还包括:检测所述文本输入组件中的文本输入速度,根据所述文本输入速度确定所述第一阈值。也就是说,在用户通过文本输入组件输入文本时,还需要对用户输 入文本的文本输入速度进行检测,例如,单位时间内平均输入字符数,然后文本输入速度确定上述第一阈值的具体取值。例如,系统预设了文本输入速度与第一阈值的取值之间的对应关系,在检测到通过文本输入组件输入文本之后,根据检测到的文本输入速度在预设的文本输入速度与第一阈值的取值之间的对应关系中查找与检测到的文本输入速度对应的第一阈值作为分词处理中的第一阈值。
在另一个实施例中,文本输入速度不仅可以是检测到的通过文本输入组件输入文本的文本输入速度,还可以是根据用户的历史字符串输入操作确定的文本输入速度,例如,获取用户输入的字符串输入操作或文本输入操作对应的文本输入速度的历史数据,并根据该历史数据确定与用户对应的文本输入速度,然后根据确定的文本输入速度确定与之对应的第一阈值。
需要说明的是,在本实施例中,上述用户输入的字符串输入操作或文本输入操作对应的文本输入速度的历史数据,可以是通过获取与当前登录的账户(例如,输入法中登录的账户,或者分词应用中登录的账户)对应的文本输入操作对应的历史数据,也可以是获取通过当前终端输入文本输入操作对应的历史数据。
在本实施例中,若展示文本输入组件的应用场景为聊天窗口的对话页面,用户在输入需要发送的文本之后,会点击发送按钮来将通过文本输入组件输入的文本进行发送,在此种情况下,对输入文本进行分词处理。也就是说,在用户通过文本输入组件进行输入时,仅仅通过文本输入组件检测用户的文本输入操作,且检测文本输入操作所输入的字符串或文本以及与之对应的时间戳,然后在用户点击发送按钮时,才对输入文本进行分词处理。在其他应用场景中,用户还可以通过其他操作来触发分词的执行,例如,输入文本的提交,或输入文本的导入,或输入文 本的保存等。
在一个具体的实施例中,在用户通过文本输入组件输入文本输入操作时,检测用户输入的每一个字符串输入操作对应的目标字符串以及时间戳,在用户完成所有的文本输入之后,将目标字符串以及对应的时间戳发送中分词模块进行处理。
此外,为提高对用户输入的文本进行分词的分词准确度,在本实施例中,还提出了一种分词装置。
具体的,如图5所示,上述分词装置包括文本输入操作检测模块102、输入文本展示模块104、延迟输入时长计算模块106以及分词模块108,其中:
文本输入操作检测模块102,用于检测文本输入组件中的文本输入操作,所述文本输入操作包括顺序的一次或一次以上的字符串输入操作;
输入文本展示模块104,用于将所述一次或一次以上的字符串输入操作输入的一个或一个以上的目标字符串进行顺序拼接,并将拼接成的输入文本展示在所述文本输入组件中,其中每个目标字符串为一次字符串输入操作所输入的字符串;
延迟输入时长计算模块106,用于获取每个目标字符串的延迟输入时长,所述延迟输入时长为所述目标字符串对应的字符串输入操作与其相邻的目标字符串对应的字符串输入操作的间隔时长;
分词模块108,用于将每个目标字符串的延迟输入时长作为分词条件对所述输入文本进行分词。
在一个实施例中,分词模块108还用于通过分词组件对所述输入文本进行分词得到至少一个目标单词,所述目标单词中包含多个目标字符串;获取所述目标单词中,对应的延迟输入时长大于或等于第一阈值的 目标字符串的结束位置;根据所述结束位置对所述目标单词进行分词。
在一个实施例中,分词模块108还用于获取在所述输入文本中相邻的两个目标单词之间衔接处的第一目标字符串和第二目标字符串,其中所述第一目标字符串和第二目标字符串分别属于所述两个目标单词,如果所述第一目标字符串与第二目标字符串对应的字符串输入操作的间隔时长小于预定的第二阈值,将所述相邻的目标单词合并。
在一个实施例中,分词模块108还用于获取所述输入文本中,对应的延迟输入时长大于或等于第一阈值的目标字符串的结束位置;根据所述结束位置将所述输入文本切分成至少一个文本段;对每个文本段分别进行分词处理。
在一个实施例中,延迟输入时长计算106模块还用于获取每个目标字符串对应的字符串输入操作的第一时间戳,以及其相邻的目标字符串对应的字符串输入操作的第二时间戳;根据所述第一时间戳和所述第二时间戳确定所述目标字符串的延迟输入时长。
在一个实施例中,分词模块108还用于获取所述输入文本中包含的分隔符的分隔位置,按照所述分隔位置将所述输入文本切分成至少一个文本段;对每个文本段分别根据所述文本段中的目标字符串的延迟输入时长进行分词。
在一个实施例中,如图5所示,上述装置还包括阈值确定模块110,用于检测所述文本输入组件中的文本输入速度,根据所述文本输入速度确定所述第一阈值,其中所述文本输入速度为单位时间内平均输入字符数。
实施本申请实施例,将具有如下有益效果:
采用了上述分词方法和装置之后,在用户通过文本输入组件输入文本时,记录用户输入的每一个字符或字符串的输入时间,并计算相邻的 两个字符或字符串之间的间隔时长,然后在该间隔时长大于预设值时,在分词处理的过程中,在该相邻的两个字符或字符串之间进行切分。也就是说,在对用户输入的输入文本进行分词处理的过程中,考虑用户输入的每一个字符或字符串的时间,若用户输入两个字符或字符串之间的间隔时间较长时,认为这两个字符或字符串之间没有语义上的相邻并将其切分开来。采用本申请实施例,在分词的过程中考虑了用户在输入文本时的实际情况,提高了分词的准确性。
在上述实施例中,可以全部或部分的通过软件、硬件、固件或者其任意组合来实现。当使用软件程序实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或者数据中心通过有线(例如同轴电缆、光纤、数字用户线(DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或者数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或半导体介质(例如固态硬盘Solid State Disk(SSD))等。
在一个实施例中,如图6所示,图6展示了一种运行上述分词方法的基于冯诺依曼体系的计算机系统的终端。该计算机系统可以是智能手机、平板电脑、掌上电脑、笔记本电脑、个人电脑、头戴式设备、可穿 戴设备、智能音箱等智能家居设备。具体的,可包括通过系统总线连接的外部输入接口1001、处理器1002、存储器1003和输出接口1004。其中,外部输入接口1001可至少包括网络接口10012。存储器1003可包括外存储器10032(例如硬盘、光盘或软盘等)和内存储器10034。输出接口1004可至少连接显示屏10042等设备。
在本实施例中,本方法的运行基于计算机程序,该计算机程序的程序文件存储于前述基于冯诺依曼体系的计算机系统的外存储器10032中,在运行时被加载到内存储器10034中,然后被编译为机器码之后传递至处理器1002中执行,从而使得基于冯诺依曼体系的计算机系统中形成逻辑上的文本输入操作检测模块102、输入文本展示模块104、延迟输入时长计算模块106、分词模块108以及阈值确定模块110。且在上述分词方法执行过程中,输入的参数均通过外部输入接口1001接收,并传递至存储器1003中缓存,然后输入到处理器1002中进行处理,处理的结果数据或缓存于存储器1003中进行后续地处理,或被传递至输出接口1004进行输出。
具体的,处理器1002用于执行如下操作:
检测文本输入组件中的文本输入操作,所述文本输入操作包括顺序的一次或一次以上的字符串输入操作;
将所述一次或一次以上的字符串输入操作输入的一个或一个以上的目标字符串进行顺序拼接,并将拼接成的输入文本展示在所述文本输入组件中其中,每个目标字符串为一次字符串输入操作所输入的字符串;
获取每个目标字符串的延迟输入时长,所述延迟输入时长为所述目标字符串对应的字符串输入操作与其相邻的目标字符串对应的字符串输入操作的间隔时长;
将每个目标字符串的延迟输入时长作为分词条件对所述输入文本 进行分词。
在其中一个实施例中,处理器1002还用于通过分词组件对所述输入文本进行分词得到至少一个目标单词,所述目标单词中包含多个目标字符串;获取所述目标单词中,对应的延迟输入时长大于或等于第一阈值的目标字符串的结束位置;根据所述结束位置对所述目标单词进行分词。
在其中一个实施例中,处理器1002还用于获取在所述输入文本中相邻的两个目标单词之间衔接处的第一目标字符串和第二目标字符串,其中,所述第一目标字符串和第二目标字符串分别属于所述两个目标单词;如果所述第一目标字符串与第二目标字符串对应的字符串输入操作的间隔时长小于预设的第二阈值,将所述相邻的两个目标单词合并。
在其中一个实施例中,处理器1002还用于获取所述输入文本中,对应的延迟输入时长大于或等于第一阈值的目标字符串的结束位置;根据所述结束位置将所述输入文本切分成至少一个文本段;对每个文本段分别进行分词处理。
在其中一个实施例中,处理器1002还用于获取所述目标字符串对应的字符串输入操作的第一时间戳,以及其相邻的目标字符串对应的字符串输入操作的第二时间戳;根据所述第一时间戳和所述第二时间戳确定所述目标字符串的延迟输入时长。
在其中一个实施例中,处理器1002还用于获取所述输入文本中包含的分隔符的分隔位置,按照所述分隔位置将所述输入文本切分成至少一个文本段;对每个文本段分别根据所述文本段中的目标字符串的延迟输入时长进行分词。
在其中一个实施例中,处理器1002还用于检测所述文本输入组件中的文本输入速度,根据所述文本输入速度确定所述第一阈值,其中所 述文本输入速度为单位时间内平均输入字符数。
以上所揭露的仅为本申请一些实施例而已,当然不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。

Claims (15)

  1. 一种分词方法,应用于计算机设备,包括:
    检测文本输入组件中的文本输入操作,所述文本输入操作包括顺序的一次或一次以上的字符串输入操作;
    将所述一次或一次以上的字符串输入操作所输入的一个或一个以上的目标字符串进行顺序拼接,并将拼接成的输入文本展示在所述文本输入组件中,其中每个目标字符串为一次字符串输入操作所输入的字符串;
    获取每个目标字符串的延迟输入时长,所述延迟输入时长为所述目标字符串对应的字符串输入操作与相邻的目标字符串对应的字符串输入操作的间隔时长;
    将每个目标字符串的延迟输入时长作为分词条件对所述输入文本进行分词。
  2. 根据权利要求1所述的分词方法,所述将每个目标字符串的延迟输入时长作为分词条件对所述输入文本进行分词为:
    通过分词组件对所述输入文本进行分词得到至少一个目标单词,所述目标单词中包含多个目标字符串;
    获取所述目标单词中,对应的延迟输入时长大于或等于第一阈值的目标字符串的结束位置;
    根据所述结束位置对所述目标单词进行分词。
  3. 根据权利要求2所述的分词方法,所述通过分词组件对所述输入文本进行分词得到至少一个目标单词之后还包括:
    获取在所述输入文本中相邻的两个目标单词之间衔接处的第一目标字符串和第二目标字符串,其中,所述第一目标字符串和第二目标字符串分别属于所述两个目标单词;
    如果所述第一目标字符串与第二目标字符串对应的字符串输入操作的间隔时长小于预设的第二阈值,将所述相邻的两个目标单词合并。
  4. 根据权利要求1所述的分词方法,所述将每个目标字符串的延迟输入时长作为分词条件对所述输入文本进行分词为:
    获取所述输入文本中,对应的延迟输入时长大于或等于第一阈值的目标字符串的结束位置;
    根据所述结束位置将所述输入文本切分成至少一个文本段;
    对每个文本段分别进行分词处理。
  5. 根据权利要求1所述的分词方法,所述获取每个目标字符串的延迟输入时长为:
    获取所述目标字符串对应的字符串输入操作的第一时间戳,以及其相邻的目标字符串对应的字符串输入操作的第二时间戳;
    根据所述第一时间戳和所述第二时间戳确定所述目标字符串的延迟输入时长。
  6. 根据权利要求1所述的分词方法,所述将每个目标字符串的延迟输入时长作为分词条件对所述输入文本进行分词包括:
    获取所述输入文本中包含的分隔符的分隔位置,按照所述分隔位置将所述输入文本切分成至少一个文本段;
    对每个文本段,分别根据所述文本段中的目标字符串的延迟输入时长进行分词。
  7. 根据权利要求2或4所述的分词方法,所述方法还包括:
    检测所述文本输入组件中的文本输入速度,根据所述文本输入速度确定所述第一阈值,其中所述文本输入速度为单位时间内平均输入字符数。
  8. 一种分词装置,包括:
    处理器;
    与所述处理器相连接的存储器;所述存储器中存储有机器可读指令模块;所述机器可读指令模块包括:
    文本输入操作检测模块,用于检测文本输入组件中的文本输入操作,所述文本输入操作包括顺序的一次或一次以上的字符串输入操作;
    输入文本展示模块,用于将所述一次或一次以上的字符串所输入操作所输入的一个或一个以上的目标字符串进行顺序拼接,并将拼接成的输入文本展示在所述文本输入组件中,其中每个目标字符串为一次字符串输入操作所输入的字符串;
    延迟输入时长计算模块,用于获取每个目标字符串的延迟输入时长,所述延迟输入时长为所述目标字符串对应的字符串输入操作与其相邻的目标字符串对应的字符串输入操作的间隔时长;
    分词模块,用于将每个目标字符串的延迟输入时长作为分词条件对所述输入文本进行分词。
  9. 根据权利要求8所述的分词装置,所述分词模块还用于通过分词组件对所述输入文本进行分词得到至少一个目标单词,所述目标单词中包含多个目标字符串;获取所述目标单词中,对应的延迟输入时长大于或等于第一阈值的目标字符串的结束位置;根据所述结束位置对所述目标单词进行分词。
  10. 根据权利要求9所述的分词装置,所述分词模块还用于获取在所述输入文本中相邻的两个目标单词之间衔接处的第一目标字符串和第二目标字符串,其中所述第一目标字符串和第二目标字符串分别属于所述两个目标单词,如果所述第一目标字符串与第二目标字符串对应的字符串输入操作的间隔时长小于预定的第二阈值,将所述相邻的两个目标单词合并。
  11. 根据权利要求8所述的分词装置,所述分词模块还用于获取所述输入文本中,对应的延迟输入时长大于或等于第一阈值的目标字符串的结束位置;根据所述结束位置将所述输入文本切分成至少一个文本段;对每个文本段分别进行分词处理。
  12. 根据权利要求8所述的分词装置,所述延迟输入时长计算模块还用于获取每个目标字符串对应的字符串输入操作的第一时间戳,以及其相邻的目标字符串对应的字符串输入操作的第二时间戳;根据所述第一时间戳和所述第二时间戳确定所述目标字符串的延迟输入时长。
  13. 根据权利要求8所述的分词装置,所述分词模块还用于获取所述输入文本中包含的分隔符的分隔位置,按照所述分隔位置将所述输入文本切分成至少一个文本段;对每个文本段分别根据所述文本段中的目标字符串的延迟输入时长进行分词。
  14. 根据权利要求9或11所述的分词装置,所述装置还包括阈值确定模块,用于检测所述文本输入组件中的文本输入速度,根据所述文本输入速度确定所述第一阈值,其中所述文本输入速度为单位时间内平均输入字符数。
  15. 一种非易失性计算机可读存储介质,其中所述存储介质中存储有机器可读指令,所述机器可读指令可以由处理器执行以完成权利要求1至7所述的方法。
PCT/CN2018/081536 2017-04-07 2018-04-02 分词方法、装置及存储介质 WO2018184510A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201710224889.5 2017-04-07
CN201710224889.5A CN108304367B (zh) 2017-04-07 2017-04-07 分词方法及装置

Publications (1)

Publication Number Publication Date
WO2018184510A1 true WO2018184510A1 (zh) 2018-10-11

Family

ID=62872261

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/081536 WO2018184510A1 (zh) 2017-04-07 2018-04-02 分词方法、装置及存储介质

Country Status (2)

Country Link
CN (1) CN108304367B (zh)
WO (1) WO2018184510A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334338A (zh) * 2019-04-29 2019-10-15 北京小米移动软件有限公司 分词方法、装置及设备
CN113890756A (zh) * 2021-09-26 2022-01-04 网易(杭州)网络有限公司 用户账号的混乱度检测方法、装置、介质和计算设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101071422A (zh) * 2006-06-15 2007-11-14 腾讯科技(深圳)有限公司 一种音乐文件搜索处理系统及方法
CN101119334A (zh) * 2007-09-21 2008-02-06 腾讯科技(深圳)有限公司 一种获取新词的方法、系统及设备
CN101122900A (zh) * 2007-09-25 2008-02-13 中兴通讯股份有限公司 一种分词系统及方法
CN101382844A (zh) * 2008-10-24 2009-03-11 上海埃帕信息科技有限公司 一种输入间隔分词的方法
CN104951428A (zh) * 2014-03-26 2015-09-30 阿里巴巴集团控股有限公司 用户意图识别方法及装置
CN105335415A (zh) * 2014-08-04 2016-02-17 北京搜狗科技发展有限公司 基于输入预测的搜索方法和输入法系统

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464855B (zh) * 2009-01-13 2010-08-25 吴长林 含有汉语的字符串的分词方法及在字符串中检索词的方法
CN101702100A (zh) * 2009-10-28 2010-05-05 卓望数码技术(深圳)有限公司 一种文字输入方法和文字输入装置
CN103092826B (zh) * 2012-12-31 2018-06-05 百度在线网络技术(北京)有限公司 一种根据用户的输入信息构建输入词条的方法与设备
CN104462051B (zh) * 2013-09-12 2018-10-02 腾讯科技(深圳)有限公司 分词方法及装置
CN103678684B (zh) * 2013-12-25 2017-05-31 沈阳美行科技有限公司 一种基于导航信息检索的中文分词方法
CN105138514B (zh) * 2015-08-24 2018-11-09 昆明理工大学 一种基于词典的正向逐次加一字最大匹配中文分词方法
US10402734B2 (en) * 2015-08-26 2019-09-03 Google Llc Temporal based word segmentation
CN105446955A (zh) * 2015-11-27 2016-03-30 贺惠新 一种自适应的分词方法
CN106371624B (zh) * 2016-09-23 2019-03-19 百度在线网络技术(北京)有限公司 一种用于提供输入候选项的方法、装置和输入设备
CN106484677B (zh) * 2016-09-30 2019-02-12 北京林业大学 一种基于最小信息量的汉语快速分词系统及方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101071422A (zh) * 2006-06-15 2007-11-14 腾讯科技(深圳)有限公司 一种音乐文件搜索处理系统及方法
CN101119334A (zh) * 2007-09-21 2008-02-06 腾讯科技(深圳)有限公司 一种获取新词的方法、系统及设备
CN101122900A (zh) * 2007-09-25 2008-02-13 中兴通讯股份有限公司 一种分词系统及方法
CN101382844A (zh) * 2008-10-24 2009-03-11 上海埃帕信息科技有限公司 一种输入间隔分词的方法
CN104951428A (zh) * 2014-03-26 2015-09-30 阿里巴巴集团控股有限公司 用户意图识别方法及装置
CN105335415A (zh) * 2014-08-04 2016-02-17 北京搜狗科技发展有限公司 基于输入预测的搜索方法和输入法系统

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110334338A (zh) * 2019-04-29 2019-10-15 北京小米移动软件有限公司 分词方法、装置及设备
CN110334338B (zh) * 2019-04-29 2023-09-19 北京小米移动软件有限公司 分词方法、装置及设备
CN113890756A (zh) * 2021-09-26 2022-01-04 网易(杭州)网络有限公司 用户账号的混乱度检测方法、装置、介质和计算设备
CN113890756B (zh) * 2021-09-26 2024-01-02 网易(杭州)网络有限公司 用户账号的混乱度检测方法、装置、介质和计算设备

Also Published As

Publication number Publication date
CN108304367B (zh) 2021-11-26
CN108304367A (zh) 2018-07-20

Similar Documents

Publication Publication Date Title
WO2018205389A1 (zh) 语音识别方法、系统、电子装置及介质
US20210312139A1 (en) Method and apparatus of generating semantic feature, method and apparatus of training model, electronic device, and storage medium
US10789431B2 (en) Method and system of translating a source sentence in a first language into a target sentence in a second language
WO2018157789A1 (zh) 一种语音识别的方法、计算机、存储介质以及电子装置
US11308278B2 (en) Predicting style breaches within textual content
WO2016127677A1 (zh) 地址结构化方法及装置
US9818080B2 (en) Categorizing a use scenario of a product
US20170364495A1 (en) Propagation of changes in master content to variant content
US11436282B2 (en) Methods, devices and media for providing search suggestions
WO2018201600A1 (zh) 信息挖掘方法、系统、电子装置及可读存储介质
US20190079934A1 (en) Snippet Generation for Content Search on Online Social Networks
WO2018010579A1 (zh) 字符串的分词方法、装置及设备
WO2014117553A1 (en) Method and system of adding punctuation and establishing language model
WO2022135474A1 (zh) 信息推荐方法、装置及电子设备
US9811517B2 (en) Method and system of adding punctuation and establishing language model using a punctuation weighting applied to chinese speech recognized text
US20170344625A1 (en) Obtaining of candidates for a relationship type and its label
CN114595686B (zh) 知识抽取方法、知识抽取模型的训练方法及装置
US20150121200A1 (en) Text processing apparatus, text processing method, and computer program product
WO2014036827A1 (zh) 一种文本校正方法及用户设备
US20230214423A1 (en) Video generation
CN111860000A (zh) 文本翻译编辑方法、装置、电子设备及存储介质
WO2018184510A1 (zh) 分词方法、装置及存储介质
WO2023040230A1 (zh) 数据评估方法、训练方法、装置、电子设备以及存储介质
WO2020052060A1 (zh) 用于生成修正语句的方法和装置
CN108628911B (zh) 针对用户输入的表情预测

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18781086

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18781086

Country of ref document: EP

Kind code of ref document: A1