CN108304367B

CN108304367B - Word segmentation method and device

Info

Publication number: CN108304367B
Application number: CN201710224889.5A
Authority: CN
Inventors: 樊林
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2017-04-07
Filing date: 2017-04-07
Publication date: 2021-11-26
Anticipated expiration: 2037-04-07
Also published as: CN108304367A; WO2018184510A1

Abstract

The embodiment of the invention discloses a word segmentation method, which comprises the following steps: detecting a text input operation input in a text input component, wherein the text input operation comprises one or more than one character string input operation input in sequence; displaying an input text formed by sequentially splicing target character strings input by the character string input operation in the text input component; acquiring the delay input duration of the target character string, wherein the delay input duration is the interval duration between the character string input operation corresponding to the target character string and the adjacent character string input operation; and performing word segmentation on the input text by taking the delayed input duration of the target character string as a word segmentation condition. In addition, the embodiment of the invention also discloses a word segmentation device. By adopting the invention, the word segmentation accuracy can be improved.

Description

Word segmentation method and device

Technical Field

The invention relates to the technical field of data processing, in particular to a word segmentation method and a word segmentation device.

Background

Artificial Intelligence (AI), which is an analog of information process of human consciousness and thinking, is an important subject of Artificial Intelligence, especially for understanding natural language input by chinese. Unlike English space which can automatically identify words, Chinese text is a writing unit with characters as basic units, no space exists between words, and a word can be composed of a single character or a plurality of characters, namely, the number of characters contained in a word is indefinite. Thus, the first step in understanding the text of a Chinese input is word segmentation, i.e., performing a correct segmentation of the words.

The currently used word segmentation methods mainly include the following three methods: a word segmentation method based on character string matching, a word segmentation method based on understanding and a word segmentation method based on statistics. The word segmentation method based on character string matching is to match a Chinese character string to be analyzed with a vocabulary entry in a machine dictionary according to a certain strategy, and if a word is found in the dictionary, the matching is successful (one word is identified). The comprehension-based word segmentation method achieves the effect of word recognition by enabling a computer to simulate the comprehension of sentences, but therefore requires the use of a great deal of linguistic knowledge and information. The word segmentation method based on statistics is used for counting the frequency of the combination of adjacent co-occurring characters in the material, calculating the mutual occurrence information of the characters and calculating the adjacent co-occurring probability of two Chinese characters, and judging whether the two words need to be segmented or not.

For the word segmentation methods, words are segmented according to a certain fixed mode, while some texts have a plurality of word segmentation modes, and each word segmentation mode may have different meanings, so that the word segmentation result is not accurate enough; moreover, when a user inputs a text, the user's habits, such as typing habits or wrongly typing characters, are attached, and the traditional word segmentation methods do not consider the real situation of the user when the user inputs the text, neglect the real requirements of the user, and cause the lack of the accuracy of word segmentation.

In summary, the existing word segmentation method for segmenting the Chinese text input by the user has the problem of insufficient word segmentation accuracy.

Disclosure of Invention

Therefore, the word segmentation method is particularly provided for solving the technical problem that the word segmentation accuracy is insufficient in the word segmentation method for segmenting the Chinese text input by the user in the traditional technology.

A method of word segmentation, comprising:

detecting a text input operation input in a text input component, wherein the text input operation comprises one or more than one character string input operation input in sequence;

displaying an input text formed by sequentially splicing target character strings input by the character string input operation in the text input component;

acquiring the delay input duration of the target character string, wherein the delay input duration is the interval duration between the character string input operation corresponding to the target character string and the adjacent character string input operation;

and performing word segmentation on the input text by taking the delayed input duration of the target character string as a word segmentation condition.

Optionally, in one embodiment, the segmenting the input text by using the delay input duration of the target character string as a segmentation condition includes: performing word segmentation on the input text through a word segmentation component to obtain at least one target word;

acquiring the end position of a target character string of which the corresponding delay input duration is greater than or equal to a first threshold in the target word; and segmenting the target word according to the end position.

Optionally, in one embodiment, after the segmenting the input text by the segmentation component to obtain at least one target word, the method further includes: and acquiring a target character string at a joint between adjacent target words in the input text, and merging the adjacent target words.

Optionally, in one embodiment, the segmenting the input text by using the delay input duration of the target character string as a segmentation condition includes: acquiring the end position of a target character string in the input text, wherein the corresponding delay input duration is greater than or equal to a first threshold; segmenting the input text into at least one text segment according to the end position; and performing word segmentation processing on the text segments respectively.

Optionally, in one embodiment, the delay input duration for obtaining the target character string is: acquiring a first time stamp of a character string input operation corresponding to the target character string and a second time stamp of a corresponding adjacent character string input operation; and determining the delayed input time length of the target character string according to the first time stamp and the second time stamp.

Optionally, in one embodiment, the segmenting the input text by using the delay input duration of the target character string as a segmentation condition includes: acquiring the separation positions of separators contained in the input text, and cutting the input text into at least one text segment according to the separation positions; and segmenting the text segments by respectively referring to the delayed input time lengths of the target character strings in the text segments.

Optionally, in one embodiment, the method further includes: detecting a text input speed in the text input component, and determining the first threshold value according to the text input speed.

In addition, in order to solve the technical problem that the word segmentation method for segmenting the Chinese text input by the user in the traditional technology has insufficient word segmentation accuracy, the word segmentation device is also provided.

A word segmentation device comprising:

the text input operation detection module is used for detecting text input operations input in the text input assembly, and the text input operations comprise one or more than one character string input operations input in sequence;

the input text display module is used for displaying an input text formed by splicing target character strings input by the character string input operation in sequence in the text input assembly;

the delay input duration calculation module is used for acquiring the delay input duration of the target character string, wherein the delay input duration is the interval duration between the character string input operation corresponding to the target character string and the adjacent character string input operation;

and the word segmentation module is used for performing word segmentation on the input text by taking the delayed input duration of the target character string as a word segmentation condition.

Optionally, in one embodiment, the word segmentation module is further configured to segment the input text by a word segmentation component to obtain at least one target word; acquiring the end position of a target character string of which the corresponding delay input duration is greater than or equal to a first threshold in the target word; and segmenting the target word according to the end position.

Optionally, in one embodiment, the word segmentation module is further configured to obtain a target character string at a junction between adjacent target words in the input text, and merge the adjacent target words.

Optionally, in one embodiment, the word segmentation module is further configured to obtain an end position of a target character string in the input text, where a corresponding delay input duration is greater than or equal to a first threshold; segmenting the input text into at least one text segment according to the end position; and performing word segmentation processing on the text segments respectively.

Optionally, in one embodiment, the delay input duration calculation module is further configured to obtain a first time stamp of a character string input operation corresponding to the target character string and a second time stamp of a corresponding adjacent character string input operation; and determining the delayed input time length of the target character string according to the first time stamp and the second time stamp.

Optionally, in one embodiment, the word segmentation module is further configured to obtain a separation position of a separator included in the input text, and segment the input text into at least one text segment according to the separation position; and segmenting the text segments by respectively referring to the delayed input time lengths of the target character strings in the text segments.

Optionally, in one embodiment, the word segmentation apparatus further includes a threshold determination module, configured to detect a text input speed in the text input component, and determine the first threshold according to the text input speed.

The embodiment of the invention has the following beneficial effects:

after the word segmentation method and the word segmentation device are adopted, when a user inputs a text through the text input assembly, the input time of each character or character string input by the user is recorded, the interval duration between two adjacent characters or character strings is calculated, and then when the interval duration is greater than a preset value, segmentation is carried out between the two adjacent characters or character strings in the word segmentation process. That is, in the process of performing word segmentation processing on an input text input by a user, the time of each character or character string input by the user is considered, and if the interval time between two characters or character strings input by the user is long, the two characters or character strings are considered to be not semantically adjacent and segmented. By adopting the embodiment of the invention, the actual situation of the user when inputting the text is considered in the word segmentation process, and the word segmentation accuracy is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Wherein:

FIG. 1 is a flow diagram illustrating a method for word segmentation, according to one embodiment;

FIG. 2 is a diagram illustrating input via a text entry component in one embodiment;

FIG. 3 is a diagram illustrating input via a text entry component in one embodiment;

FIG. 4 is a diagram illustrating input via a text entry component in one embodiment;

FIG. 5 is a schematic diagram of an embodiment of a word segmentation apparatus;

FIG. 6 is a block diagram of a computer device for executing the word segmentation method in one embodiment.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to solve the technical problem of insufficient word segmentation accuracy of the word segmentation method for segmenting the Chinese text input by the user in the conventional technology, in the embodiment, a word segmentation method is particularly provided, which can be implemented by depending on a computer program which can run on a computer system based on the von neumann system, wherein the computer program can be a word segmentation application on a terminal or a server, or a text word segmentation application program integrated in an application which provides a text input component for receiving a character string or text input by the user. The computer system may be a server or a terminal, such as a smart phone, a tablet computer, a personal computer, etc., running the computer program.

Specifically, as shown in fig. 1, the word segmentation method includes the following steps S102 to S108:

step S102: detecting a text input operation input in a text input component, wherein the text input operation comprises one or more character string input operations input sequentially.

Text input components refer to components for which a user may enter text, e.g., text input boxes, document input interfaces, search input boxes, etc., for which a user may enter a single line or multiple lines of text. For example, in the chat interface of the instant messenger shown in fig. 2, the user can input through the displayed text input box and click the "send" button to send the input text after completing the input. In this embodiment, the text input component is not limited to the text input box in the chat interface of the instant messaging software shown in fig. 2, and may be other text input boxes, for example, a search input box, or a text editing window of a text editing software such as Word, and the like, which are all text input components.

In this embodiment, the display scene of the text input component is not limited, and the text input component may be a search input box of a search interface, a text input box of a dialog window, or a text editing window of a document editing page, as long as a user can input text through the displayed text input component.

When a user inputs characters through the text input assembly, the input operation of the user can be detected through the text input assembly, the specific content input by the user is obtained, and the operation of inputting the specific content through the text input assembly by the user is the text input operation.

It should be noted that, in this embodiment, a single text input operation by the user is a character string input operation, and the text input operation by the user includes at least one character string input operation, that is, a single character or multiple characters may be input when the user inputs a text input operation (for example, a word including multiple characters is input at one time by an input method, for example, in an application scenario shown in fig. 2, the user inputs "zhuanli" by a pinyin input method to input a character string "patent" at one time).

In the conventional text input method, many users input the text by an input method (most pinyin input methods) installed, for example, a pinyin input method included in a dog search input method installed in a terminal. For example, in the application scenarios shown in fig. 2-4, the user enters input via an installed pinyin input method. In the application scenarios shown in fig. 2-4, after the user inputs the english alphabet corresponding to the pinyin corresponding to the target character string, a candidate word display box corresponding to the input pinyin is displayed, and the user can select a character or a character string to be input in the displayed candidate word display box.

Furthermore, when the input method displays the candidate words, the matched words can be automatically displayed according to the input of the user, so that the operation times of inputting required when the user selects the target characters required to be input in the displayed candidate words are reduced. Therefore, when inputting text, most users input the words to be input through one character string input operation.

In a specific embodiment, as shown in fig. 3, when the user inputs "zhuan", the candidate word presentation box shown in fig. 3 is presented, and "special" is input by the selection operation corresponding to number 3; when the user inputs "li", the candidate word display box shown in fig. 4 is displayed, and the user needs to input "benefit" by inputting one page turning operation and one selection operation corresponding to the number "2" to complete the input of the character string "patent".

In the application scenario shown in fig. 2, the user may input the character string "patent" at one time by inputting "zhuanli", and may select and input "patent" in the displayed candidate word display box to complete the input only by inputting the selection operation corresponding to the number "1".

Therefore, in the present embodiment, the target character string input by the user through one character string input operation may be a character of a word, or may be a character string composed of a plurality of characters. That is, one character string input operation corresponds to a character string input by the user in the text input component at one time, and when the character string includes a plurality of characters, the plurality of characters included in the character string are input into the text input component, that is, the input time corresponding to the plurality of characters included in the character string is the same.

Generally, when a user inputs text through a text input component, the text sequentially input includes a plurality of character strings, and the input time of a preceding character string in the input text is limited to the input time of a following character string. That is, the text input operation includes at least one character string input operation of sequential input.

Step S104: and displaying an input text formed by sequentially splicing target character strings input by the character string input operation in the text input component.

In the present embodiment, each character string input operation corresponds to an input character string, that is, a target character string corresponding to the character string input operation. Accordingly, the input time of the character string input operation is the input time of the target character string corresponding to the character string input operation, so that the input times of all the target character strings corresponding to the text input operation input by the user can be determined, and any two target character strings are different in input time.

According to the detected character string input operation, acquiring a target character string corresponding to the detected character string input operation, and inputting the input time of each target character string (namely the input time of the character string input operation); and then determining the input sequence of the target character strings according to the input time of each target character string, and splicing the target character strings into input texts sequentially and displaying the input texts in a text input component.

It should be noted that, in this embodiment, the input text displayed in the text input component is spliced by the target character strings corresponding to the character string input operation that has been completed currently, and the user may also continue to input the target character strings by inputting the character string input operation through the text input component.

Step S106: and acquiring the delay input duration of the target character string, wherein the delay input duration is the interval duration between the character string input operation corresponding to the target character string and the adjacent character string input operation.

The delay input duration of the target character string refers to a time difference or interval time between an input time of inputting the target character string and an input time of inputting an adjacent target character string, that is, an interval duration between a character string input operation corresponding to the target character string and an adjacent character string input operation.

It should be noted that, in this embodiment, when determining the delayed input time of the target character string, the target character string adjacent to the target character string may be a target character string adjacent to the target character string before (i.e. a target character string having an input time before and adjacent to the target character string), or a target character string adjacent to the target character string after (i.e. a target character string having an input time after and adjacent to the target character string), which is not limited in this embodiment, but in the calculation of the delayed input time of all target character strings, the input time of the target character string adjacent to the target character string before or after is used, one of the target character string adjacent to the target character string after and the target character string adjacent to the target character string after must be selected, and when calculating the delayed input time corresponding to the target character string, the selection must be consistent.

In a specific embodiment, when a time interval between an input time of a current target character string and a target character string adjacent to the current target character string is calculated as a delayed input duration corresponding to the target character string, the process of obtaining the delayed input duration of the target character string specifically includes: acquiring a first time stamp of a character string input operation corresponding to a target character string and a second time stamp of a corresponding post-adjacent character string input operation; and determining the delayed input time length of the target character string according to the first time stamp and the second time stamp.

Inputting an input time corresponding to a character string input operation of a target character string, namely a time stamp corresponding to the character string input operation, namely a first time stamp; the input time of the character string input operation of one character string by the target character string adjacent to the target character string is the time stamp corresponding to the character string input operation of the target character string adjacent to the target character string, namely the second time stamp. The time interval between the first time stamp and the second time stamp is the interval duration between the input time of the target character string and the input time of the next adjacent target character string, that is, the interval duration between the first time stamp and the second time stamp is the delay input duration corresponding to the target character string.

In another specific embodiment, when the time interval between the input time of the current target character string and the input time of the previous adjacent target character string is calculated as the delay input duration corresponding to the target character string, the process of obtaining the delay input duration of the target character string specifically includes: acquiring a third time stamp of the character string input operation corresponding to the target character string and a fourth time stamp of the corresponding character string input operation adjacent to the target character string; and determining the delayed input time length of the target character string according to the third time stamp and the fourth time stamp.

Inputting an input time corresponding to a character string input operation of a target character string, namely a time stamp corresponding to the character string input operation, namely a third time stamp; the input time of the target character string adjacent to the target character string for the three character string input operations is the time stamp corresponding to the character string input operation of the target character string adjacent to the target character string, that is, the fourth time stamp. The time interval between the third time stamp and the fourth time stamp is the interval duration between the input time of the target character string and the input time of the previously adjacent target character string, that is, the interval duration between the third time stamp and the fourth time stamp is the delayed input duration corresponding to the target character string.

Step S108: and performing word segmentation on the input text by taking the delayed input duration of the target character string as a word segmentation condition.

Generally speaking, when a user manually types and inputs characters, a part of words containing a plurality of single characters can be input at one time, particularly for the user using the pinyin input method, because the input method can automatically match the words according to the pinyin input by the user, the operations of searching or turning pages and the like in displayed options matched with the input pinyin during single character input of the user are avoided, and the typing speed is improved.

For example, when a user inputs 'i want to eat a hot pot', the 'hot pot' two words are generally input at one time by the pinyin 'huguo', and selection among the options to be selected is avoided when a single word is input. In this case, the input operation of the user for inputting the "hot pot" is a character string input operation, and the timestamps corresponding to the characters "hot pot" and "hot pot" are consistent.

It should be noted that, when inputting the "hot pot" in the "i want to eat a hot pot", for example, when the user inputs the "hot pot", the input of two single characters of "fire" and "pot" is not completed at one time, but the two single characters of "fire" and "pot" are respectively input, and when the user inputs, for example, "fire", the next input character is definitely the "pot", and the user does not need to think or pause in inputting, so the interval time between the input times of the two single characters of "fire" and "pot" is also small. For example, according to the input habit of a typical user to input "i'm … wants … to eat … … fire.

Further, when a user types and inputs characters manually, the interval time between the input characters or character strings is different, some characters are short, and the interval time between partial characters is long due to thinking, pause and other reasons.

For example, when the user inputs "i'm go to celebration on an airplane at seven am on tomorrow, the input character strings are" i "," tomorrow "," morning "," seven "," airplane "," go "and" celebration ", respectively, according to the input habit of the general user," i … "," … airplane … … at seven … on tomorrow … in morning …, …, go …, celebration "," time interval between tomorrow "and" morning "will be significantly smaller than the time interval between" airplane "and" go ".

That is, most users input text with semantically connected or non-segmentable characters at a time or with a short time interval between two characters, and the time interval between input times of characters that are semantically unconnected or need to be segmented is generally larger than the time interval between input times of characters that are semantically connected or non-segmentable. That is, if there is a large time interval between the input times of inputting two characters or character strings, it indicates that the user pauses for a long time due to the habit or continuity of expression or thought in the input, in which case the two characters or character strings are semantically discontinuous or to be segmented.

In this embodiment, after the delay input durations of all the target character strings are determined, the input text may be segmented according to the delay input durations of the target character strings, that is, the delay input durations of the target character strings are used as word segmentation conditions to segment the input text. That is to say, when segmenting the input text according to the delayed input duration of the target character string, the delayed input duration of the target character string needs to be considered, specifically, the delayed input duration of the target character string needs to be used as an influence factor of where words are divided in the segmentation decision, and whether segmentation is performed between one target character string and an adjacent target character string needs to consider the delayed input duration of the target character string.

In a specific embodiment, a case where the delay input time period of the target character string is determined by the interval time in which the target character string and the target character string adjacent to the latter are input times will be described as an example.

The process of segmenting the input text by taking the delayed input duration of the target character string as the segmentation condition specifically comprises the following steps: acquiring the end position of a target character string in the input text, wherein the corresponding delay input duration is greater than or equal to a first threshold; segmenting the input text into at least one text segment according to the end position; and performing word segmentation processing on the text segments respectively.

Determining a delay input duration corresponding to each target character string aiming at all target character strings contained in an input text; since the delayed input duration of each target character string identifies the time interval between the input time of the character string and the next adjacent character string, the greater the delayed input duration of the target character string, the longer the pause time between the target character string and the next adjacent target character string, and the lower the probability that the two are semantically connected or cannot be segmented. Therefore, when the delay input duration of the target character string is greater than a preset time threshold (first threshold), the target character string is segmented with a next adjacent target character string, that is, the target character string is segmented at the end position.

It should be noted that, in this embodiment, the input text includes more than one target character string, and the number of the delayed input times of the target character strings meeting the segmentation condition in the process of segmenting the input text according to the delayed input times of the target character strings is also more than one. Therefore, in the case where the delay input duration is greater than or equal to the preset time threshold, after the input text is segmented according to the end of the target character string, the input text may be segmented into a plurality of text segments, each of which contains one or more target character strings.

It should be noted that, in the present embodiment, in the process of inputting by manually typing by the user, if the speed of manually typing by the user is fast or there is no pause in the typing process, the time interval of the input time of the two characters or character strings that should be segmented in the process of word segmentation may also be small, so that no segmentation is performed when the input text is segmented according to the delayed input duration of the character strings. In this case, the segmented text segment may be further participled, for example, by using other participling methods or participling components, so as to improve the accuracy and effectiveness of the participling.

For example, for the input text "i just bought a ticket to fly to beijing seven am in the morning" input by the user, the delay input time of each character or character string is determined as follows (where the delay input time of the target character string is determined by the interval time at which the target character string and the target character string adjacent to the latter are input times):

TABLE 1

Character or string of characters	Delay input time (unit: second)
		I am	1.5
Just	1.3
		Buy	0.5
To master	1.6
		One sheet of paper	2.0
Tomorrow	1.4
		In the morning	1.5
Seven points	2.6
		Fly away	1.8
Beijing	0.5
		Is/are as follows	1.6
Air ticket	—

In the case where the preset first threshold value is 1.8, the result of the word segmentation of the input text is that "i just bought an air ticket/seven am tomorrow/flying/beijing", which is obviously not the final word segmentation result. And continuously adopting other word segmentation algorithms to segment the word according to the word segmentation result, and obtaining a final word segmentation result which is a ticket/the ticket of I/just/buy/one/tomorrow/morning/seven/fly/Beijing/.

It should be noted that, in this embodiment, one or more combinations of known word segmentation methods or user-defined word segmentation methods may be used in the process of performing word segmentation processing on text segments, and the specific word segmentation method used is not limited in this embodiment.

For example, the employed word segmentation method may be a word segmentation method based on character string matching, i.e., a mechanical word segmentation method. And matching the Chinese character string to be analyzed with the entries in the machine dictionary according to a certain strategy, and if a certain word is found in the dictionary, successfully matching (identifying one word). According to different scanning directions, the character string matching method can be divided into forward matching and reverse matching; the criteria for preferential matching according to different lengths can be further divided into a maximum (longest) match and a minimum (shortest) match.

In another alternative embodiment, the word segmentation method based on understanding can be used for simulating the understanding of a person to a sentence through a computer to achieve the effect of word recognition, namely, the words are segmented while syntactic and semantic analysis is carried out, and ambiguity is processed by utilizing syntactic information and semantic information.

In another alternative embodiment, a statistical-based word segmentation approach may also be employed. A word is formally a stable combination of words, so in this context, the more times adjacent words occur simultaneously, the more likely it is to constitute a word. Therefore, the frequency or probability of the co-occurrence of the characters and the adjacent characters can better reflect the credibility of the words. The frequency of the combination of adjacent co-occurring words in the material can be counted to calculate their co-occurrence information. And defining the mutual occurrence information of the two characters, and calculating the adjacent co-occurrence probability of the two Chinese characters. The mutual-occurrence information embodies the closeness of the combination relationship between the Chinese characters. When the degree of closeness is above a certain threshold, it is considered that the word group may constitute a word. That is, only the word group frequency in the corpus needs to be counted, and the dictionary does not need to be segmented, so the method is also called a dictionary-free word segmentation method or a statistical word extraction method.

When a user inputs text through the text input component, punctuation marks, numbers, English abbreviations or English letters or other non-Chinese characters can be input according to requirements, for example, "I sends a message to you on QQ at 7 o' clock at last night," do you see "7", "QQ", "all of which are non-Chinese characters.

Generally, when segmenting an input text, a word is composed of several kanji characters or several english characters, rather than a combination of kanji characters and non-kanji characters. That is, there is a pause or a position where the segmentation is performed between the kanji character and the non-kanji character.

Therefore, when punctuation marks, numbers, English abbreviations or English letters or other non-Chinese characters are included in the input text, the Chinese characters and the non-Chinese characters can be directly segmented. Specifically, in the process of segmenting the input text by taking the delayed input duration of the target character string as a segmentation condition, obtaining the separation position of a separator included in the input text, and segmenting the input text into at least one text segment according to the separation position; and segmenting the text segments by respectively referring to the delayed input time lengths of the target character strings in the text segments.

The separator is punctuation, English symbol, English letter and other non-Chinese character symbol. In the first step of segmenting the input text, the input text may be segmented according to the position of the separator included in the input text, the separator is segmented from other characters in the input text, so that the input text is segmented into a plurality of text segments, and then the text segments are segmented.

For example, in the case where the input text is "i sent a message to you on QQ at 7 o 'clock before last night, and you see do", the input text is segmented according to the delimiter to obtain "i sent a message to you on/QQ at 7 o' clock/,/you see do", which includes 4 text segments "i sent a message to you at last night", "i sent a message to you before", "you see do", and then the 4 text segments are respectively subjected to the segmentation processing. By adopting the embodiment, the input text is segmented according to the separators in advance, so that the calculation amount in the subsequent segmentation processing is reduced, and the segmentation efficiency of the segmentation processing is improved.

It should be noted that, when the user manually inputs the input text, the user may manually input one character or character string through handwriting or a soft keyboard, or may directly input a plurality of character strings at one time through a copy and paste manner, in this case, the input time corresponding to all the copied and pasted characters or character strings is the same. When at least one adjacent character string pasted by copying is included in the input text, all the characters pasted by copying are directly divided into a text segment according to the delayed input time, and then the text segment is subjected to word segmentation.

In another optional embodiment, the process of segmenting the input text by using the delay input duration of the target character string as the segmentation condition specifically includes: performing word segmentation on the input text through a word segmentation component to obtain at least one target word; acquiring the end position of a target character string of which the corresponding delay input duration is greater than or equal to a first threshold in the target word; and segmenting the target word according to the end position.

That is to say, for an input text that needs to be participled, the input text is firstly participled by using other participle algorithms, for example, the input text is participled by using the least segmentation algorithm, so as to obtain at least one target word corresponding to the input text. Because the obtained target words may have the problem of incomplete word segmentation, for at least one target word obtained by performing word segmentation processing on the input text through the word segmentation component, the delay input durations corresponding to all characters or character strings included in the target word are obtained, and then the target word is segmented according to the delay input durations corresponding to the characters or the character strings.

It should be noted that, in this embodiment, the word segmentation component may be based on any word segmentation algorithm (including a user-defined word segmentation algorithm) or multiple word segmentation algorithms, for example, may be based on one or more combinations of word segmentation algorithms such as a maximum forward matching method, a reverse maximum matching method, a minimum segmentation method, a two-way matching method, or a full segmentation algorithm.

It should be noted that, in this embodiment, there may be a problem that at least one target word obtained from the input text according to the word segmentation component is incompletely segmented, so that further word segmentation is required; however, in another alternative embodiment, at least one target word obtained by segmenting the input text according to the segmentation component may have the problem of over-segmentation, i.e. segmenting words containing more than one single word that should not be segmented. In this case, according to whether a target character string (a target character string cut apart by adjacent like target words) exists at the junction of adjacent target words in all the target words cut out from the input text, if so, the adjacent two target words are merged.

Specifically, after the word segmentation component is used for carrying out word segmentation on the input text to obtain at least one target word, the method further comprises the following steps: and acquiring a target character string at a joint between adjacent target words in the input text, merging the adjacent target words, skipping the step of merging the target words if the target character string does not exist at the joint between the adjacent target words in the input text, directly acquiring the end position of the target character string of which the corresponding delay input duration is greater than or equal to a first threshold value in the target words, and segmenting the target words according to the end position.

For example, in one particular embodiment, the input text is "i sent you a piece of mail last night", and the at least one target word resulting from the tokenization of the input text by the tokenization component is "i/yesterday/night/to/you/sent/one/piece/mail". In the character string operation according to the input text input by the user, "yesterday" and "late" are completed by a character string input operation, i.e., "yesternight" is an unsingulated character string. That is, "yesternight" is the target string at the junction between the target words "yesterday" and "night". In this case, the target words "yesterday" and "night" are merged, resulting in "i/yesternight/give/you/send/one/mail".

In another embodiment, the case that the adjacent target words need to be merged includes not only the target character string in which the junction of the target words is split, but also the case that two adjacent target character strings exist at the junction of the adjacent target words, and the input time corresponding to the two adjacent target character strings is sufficiently small.

Namely, after the segmentation component is used for segmenting the input text to obtain at least one target word, the method further comprises the following steps: acquiring a target character string at a junction between adjacent target words in the input text (a target character string adjacent to a subsequent target word contained in a previous target word in the adjacent target words, and a target character string adjacent to an online target word in a character string contained in the subsequent target word), determining a delayed input duration of the target character string, and merging the adjacent target words when the delayed input duration is less than a preset second threshold.

For example, in one particular embodiment, the input text entered by the user is "i am visiting the palace museum today", and the at least one target word resulting from the tokenizing of the input text by the tokenizing component is "i/today/visiting/palace/museum". The target character string between the target words "the old palace" and "the museum" comprises two target character strings of the "the old palace" and "the museum", and the interval duration of the character string input operation corresponding to the "the old palace" and the target character string "the museum" adjacent to the old palace "is calculated to be 0.5s, and the preset second threshold value is 1s, so that the delay input duration of the target character string" the old palace "is smaller than the second threshold value, and in this case, the two target character strings of the" the old palace "and" the museum "are not required to be divided, so that the target words" the old palace "and" the museum "are combined to obtain" me/today/visit/museum ".

Specifically, a timestamp corresponding to a character included in an input text is determined, at least one target word is determined, a timestamp corresponding to the last character of the target word is obtained, a timestamp corresponding to the first character of an adjacent target word behind the target word is obtained, then a time interval corresponding to two timestamps is calculated, if the time interval is smaller than a preset time threshold (namely, a second threshold is smaller than a first threshold), it is determined that the target word and the adjacent target word should not be split, and therefore, the split between the target word and the adjacent target word can be cancelled.

For example, the second threshold may be 0.3s, and if the time interval between the input times corresponding to the two characters is less than 0.3s (for example, the time interval is 0s), the segmentation between the target word and the next adjacent target word is cancelled or the target word and the next adjacent target word are merged.

In another embodiment, different user input speeds are required, for example, a user who frequently uses a computer may type at a significantly faster speed than an older user who does not frequently use a computer, in which case, if the same time threshold is still used for all users when performing word segmentation processing according to the delayed input duration, the problem of incomplete or inaccurate word segmentation may result.

Therefore, in a specific embodiment, the word segmentation method further includes: detecting a text input speed in the text input component, and determining the first threshold value according to the text input speed. That is, when a user inputs a text through the text input component, the text input speed of the text input by the user needs to be detected, for example, the average number of input characters in a unit time, and then the text input speed determines the specific value of the first threshold. For example, the system presets a corresponding relationship between the text input speed and a value of a first threshold, and after a text is detected to be input through the text input component, searches the first threshold corresponding to the detected text input speed in the corresponding relationship between the preset text input speed and the value of the first threshold according to the detected text input speed to serve as the first threshold in the word segmentation processing.

In another alternative embodiment, the text input speed may be not only the detected text input speed for inputting text through the text input component, but also a text input speed determined according to the historical character string input operation of the user, for example, obtaining historical data of the text input speed corresponding to the character string input operation or the text input operation input by the user, determining the text input speed corresponding to the user according to the historical data, and then determining the first threshold corresponding to the determined text input speed according to the determined text input speed.

In the present embodiment, the history data of the text input speed corresponding to the character string input operation or the text input operation input by the user may be history data corresponding to a text input operation corresponding to a currently registered account (for example, an account registered in an input method or an account registered in a segmentation application), or history data corresponding to a text input operation input by a current terminal.

In this embodiment, if the application scene showing the text input component is a dialog page of a chat window, after the user inputs a text to be sent, the user clicks a send button to send the text input by the text input component, and in this case, the input text is subjected to word segmentation processing. That is, when a user inputs through the text input component, only the text input operation of the user is detected through the text input component, and a character string or a text input by the text input operation and a time stamp corresponding thereto are detected, and then when the user clicks a send button, the input text is subjected to word segmentation processing. In other application scenarios, the user may also trigger the execution of the word segmentation through other operations, for example, submission of the input text, or import of the input text, or saving of the input text, and the like.

In a specific embodiment, when a user inputs a text input operation through a text input component, a target character string and a time stamp corresponding to each character string input operation input by the user are detected, and after the user completes all text inputs, the target character string and the corresponding time stamp are sent to a segmentation module for processing.

In addition, in order to solve the technical problem that the word segmentation method for segmenting the Chinese text input by the user in the traditional technology has insufficient word segmentation accuracy, the embodiment further provides a word segmentation device.

Specifically, as shown in fig. 5, the word segmentation apparatus includes a text input operation detection module 102, an input text display module 104, a delay input duration calculation module 106, and a word segmentation module 108, where:

a text input operation detection module 102, configured to detect a text input operation input in a text input component, where the text input operation includes one or more character string input operations that are input sequentially;

an input text display module 104, configured to display an input text formed by sequentially splicing target character strings input by the character string input operation in the text input component;

a delay input duration calculation module 106, configured to obtain a delay input duration of the target character string, where the delay input duration is an interval duration between a character string input operation corresponding to the target character string and an adjacent character string input operation;

and the word segmentation module 108 is configured to segment the input text by using the delay input duration of the target character string as a word segmentation condition.

Optionally, in an embodiment, the word segmentation module 108 is further configured to segment the input text by a word segmentation component to obtain at least one target word; acquiring the end position of a target character string of which the corresponding delay input duration is greater than or equal to a first threshold in the target word; and segmenting the target word according to the end position.

Optionally, in an embodiment, the word segmentation module 108 is further configured to obtain a target character string at a junction between adjacent target words in the input text, and merge the adjacent target words.

Optionally, in an embodiment, the word segmentation module 108 is further configured to obtain an end position of a target character string in the input text, where a corresponding delay input duration is greater than or equal to a first threshold; segmenting the input text into at least one text segment according to the end position; and performing word segmentation processing on the text segments respectively.

Optionally, in an embodiment, the delayed input duration calculation module 106 is further configured to obtain a first time stamp of a character string input operation corresponding to the target character string and a second time stamp of a corresponding adjacent character string input operation; and determining the delayed input time length of the target character string according to the first time stamp and the second time stamp.

Optionally, in an embodiment, the word segmentation module 108 is further configured to obtain a separation position of a separator included in the input text, and segment the input text into at least one text segment according to the separation position; and segmenting the text segments by respectively referring to the delayed input time lengths of the target character strings in the text segments.

Optionally, in an embodiment, as shown in fig. 5, the apparatus further includes a threshold determining module 110, configured to detect a text input speed in the text input component, and determine the first threshold according to the text input speed.

The embodiment of the invention has the following beneficial effects:

In the above embodiments, all or part of the implementation may be realized by software, hardware, firmware, or any combination thereof. When implemented using a software program, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

In one embodiment, as shown in fig. 6, fig. 6 illustrates a terminal of a computer system based on von neumann architecture running the above word segmentation method. The computer system can be terminal equipment such as a smart phone, a tablet computer, a palm computer, a notebook computer or a personal computer. Specifically, an external input interface 1001, a processor 1002, a memory 1003, and an output interface 1004 connected through a system bus may be included. The external input interface 1001 may optionally include at least a network interface 10012. Memory 1003 can include external memory 10032 (e.g., a hard disk, optical or floppy disk, etc.) and internal memory 10034. The output interface 1004 may include at least a display 10042 or the like.

In this embodiment, the method is executed based on a computer program, program files of which are stored in the external memory 10032 of the computer system based on the von neumann system, loaded into the internal memory 10034 at the time of execution, and then compiled into machine code and then transferred to the processor 1002 to be executed, so that the text input operation detection module 102, the input text presentation module 104, the delay input duration calculation module 106, the word segmentation module 108, and the threshold determination module 110 are logically formed in the computer system based on the von neumann system. In the process of executing the word segmentation method, the input parameters are received through the external input interface 1001, transferred to the memory 1003 for caching, and then input to the processor 1002 for processing, and the processed result data is cached in the memory 1003 for subsequent processing or transferred to the output interface 1004 for outputting.

Specifically, the processor 1002 is configured to perform the following operations:

Optionally, in one embodiment, the processor 1002 is further configured to perform word segmentation on the input text through a word segmentation component to obtain at least one target word; acquiring the end position of a target character string of which the corresponding delay input duration is greater than or equal to a first threshold in the target word; and segmenting the target word according to the end position.

Optionally, in one embodiment, the processor 1002 is further configured to obtain a target character string at a junction between adjacent target words in the input text, and merge the adjacent target words.

Optionally, in one embodiment, the processor 1002 is further configured to obtain an end position of a target character string in the input text, where a corresponding delay input duration is greater than or equal to a first threshold; segmenting the input text into at least one text segment according to the end position; and performing word segmentation processing on the text segments respectively.

Optionally, in one embodiment, the processor 1002 is further configured to obtain a first time stamp of a character string input operation corresponding to the target character string, and a second time stamp of a corresponding adjacent character string input operation; and determining the delayed input time length of the target character string according to the first time stamp and the second time stamp.

Optionally, in one embodiment, the processor 1002 is further configured to obtain a separation position of a separator included in the input text, and cut the input text into at least one text segment according to the separation position; and segmenting the text segments by respectively referring to the delayed input time lengths of the target character strings in the text segments.

Optionally, in one embodiment, the processor 1002 is further configured to detect a text input speed in the text input component, and determine the first threshold according to the text input speed.

The above disclosure is only for the purpose of illustrating the preferred embodiments of the present invention, and it is therefore to be understood that the invention is not limited by the scope of the appended claims.

Claims

1. A method of word segmentation, comprising:

detecting a text input operation input in a text input component, wherein the text input operation comprises one or more than one character string input operation which is input sequentially, and one character string input operation corresponds to a character string which is input in the text input component at one time;

performing word segmentation on the input text by taking the delayed input duration of the target character string and a first threshold as word segmentation conditions, wherein the word segmentation comprises the following steps: performing word segmentation processing on the input text according to the end position of the target character string with the delayed input duration being greater than or equal to a first threshold value, wherein the first threshold value is determined according to the text input speed in the text input component, and the text input component comprises an input box or a text editing window.

2. The word segmentation method according to claim 1, wherein the performing word segmentation on the input text according to the end position of the target character string with the delayed input duration being greater than or equal to the first threshold is:

performing word segmentation on the input text through a word segmentation component to obtain at least one target word;

acquiring the end position of a target character string of which the corresponding delay input duration is greater than or equal to a first threshold in the target word;

and segmenting the target word according to the end position.

3. The word segmentation method according to claim 2, wherein the obtaining of at least one target word by segmenting the input text through the word segmentation component further comprises:

and acquiring a target character string at a joint between adjacent target words in the input text, and merging the adjacent target words.

4. The word segmentation method according to claim 1, wherein the performing word segmentation on the input text according to the end position of the target character string with the delayed input duration being greater than or equal to the first threshold is:

acquiring the end position of a target character string in the input text, wherein the corresponding delay input duration is greater than or equal to a first threshold;

segmenting the input text into at least one text segment according to the end position;

and performing word segmentation processing on the text segments respectively.

5. The word segmentation method according to claim 1, wherein the delay input duration for obtaining the target character string is:

acquiring a first time stamp of a character string input operation corresponding to the target character string and a second time stamp of a corresponding adjacent character string input operation;

and determining the delayed input time length of the target character string according to the first time stamp and the second time stamp.

6. The word segmentation method according to claim 1, wherein segmenting the input text by using the delayed input duration of the target character string as a word segmentation condition comprises:

acquiring the separation positions of separators contained in the input text, and cutting the input text into at least one text segment according to the separation positions;

and segmenting the text segments by respectively referring to the delayed input time lengths of the target character strings in the text segments.

7. The word segmentation method according to claim 2 or 4, characterized in that the method further comprises:

detecting a text input speed in the text input component, and determining the first threshold value according to the text input speed.

8. A word segmentation device, comprising:

the text input operation detection module is used for detecting text input operation input in the text input assembly, wherein the text input operation comprises one or more than one character string input operation which is input sequentially, and one character string input operation corresponds to a character string which is input in the text input assembly at one time;

a word segmentation module, configured to segment the input text by using the delay input duration of the target character string and a first threshold as word segmentation conditions, including: performing word segmentation processing on the input text according to the end position of the target character string with the delayed input duration being greater than or equal to a first threshold value, wherein the first threshold value is determined according to the text input speed in the text input component, and the text input component comprises an input box or a text editing window.

9. The word segmentation device according to claim 8, wherein the word segmentation module is further configured to segment the input text by a word segmentation component to obtain at least one target word; acquiring the end position of a target character string of which the corresponding delay input duration is greater than or equal to a first threshold in the target word; and segmenting the target word according to the end position.

10. The word segmentation apparatus according to claim 9, wherein the word segmentation module is further configured to obtain a target character string at a junction between adjacent target words in the input text, and merge the adjacent target words.

11. The word segmentation device according to claim 8, wherein the word segmentation module is further configured to obtain an end position of a target character string in the input text, where a corresponding delay input duration is greater than or equal to a first threshold; segmenting the input text into at least one text segment according to the end position; and performing word segmentation processing on the text segments respectively.

12. The word segmentation device according to claim 8, wherein the delayed input duration calculation module is further configured to obtain a first timestamp of a character string input operation corresponding to the target character string and a second timestamp of a corresponding adjacent character string input operation; and determining the delayed input time length of the target character string according to the first time stamp and the second time stamp.

13. The word segmentation device according to claim 8, wherein the word segmentation module is further configured to obtain a separation position of a separator included in the input text, and segment the input text into at least one text segment according to the separation position; and segmenting the text segments by respectively referring to the delayed input time lengths of the target character strings in the text segments.

14. The apparatus according to claim 9 or 11, further comprising a threshold determination module for detecting a text input speed in the text input component, wherein the first threshold is determined according to the text input speed.

15. A computer-readable storage medium, characterized in that the computer storage medium stores a computer program which, when executed, is adapted to carry out the method of any one of claims 1-7.