WO2020052069A1 - 用于分词的方法和装置 - Google Patents

用于分词的方法和装置 Download PDF

Info

Publication number
WO2020052069A1
WO2020052069A1 PCT/CN2018/116345 CN2018116345W WO2020052069A1 WO 2020052069 A1 WO2020052069 A1 WO 2020052069A1 CN 2018116345 W CN2018116345 W CN 2018116345W WO 2020052069 A1 WO2020052069 A1 WO 2020052069A1
Authority
WO
WIPO (PCT)
Prior art keywords
vocabulary
text
preset
information
sequence
Prior art date
Application number
PCT/CN2018/116345
Other languages
English (en)
French (fr)
Inventor
邓江东
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Priority to US16/981,273 priority Critical patent/US20210042470A1/en
Publication of WO2020052069A1 publication Critical patent/WO2020052069A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Definitions

  • the embodiments of the present application relate to the field of computer technology, and in particular, to a method and device for word segmentation.
  • participle refers to Chinese participle.
  • word segmentation a Chinese character sequence can be cut into one or more words.
  • Word segmentation is the basis of text mining.
  • the computer can automatically recognize the meaning of the sentence.
  • this method of making the computer automatically recognize the meaning of the sentence by word segmentation is also called the mechanical word segmentation method. Its main principle is to match the Chinese character string to be analyzed with the entry in the preset machine dictionary according to a certain strategy. Determine the target entry corresponding to the Chinese character string to be analyzed.
  • the embodiments of the present application propose a method and device for word segmentation.
  • an embodiment of the present application provides a method for word segmentation.
  • the method includes: obtaining a preset vocabulary set and a text to be segmented, wherein the preset vocabulary set is a vocabulary set generated in advance based on the preset text set,
  • the vocabulary in the preset vocabulary set includes first information and second information, the first information is used to characterize the probability of the vocabulary appearing in the preset text set, and for the vocabulary in the preset vocabulary set, the second information is used to characterize the Let the text set be conditional on the appearance of a vocabulary other than the vocabulary, and the conditional probability of the vocabulary appearing; based on a preset vocabulary set, segment the text to be segmented to obtain at least one vocabulary sequence; for the vocabulary sequence in at least one vocabulary sequence To determine the first information and the second information of the vocabulary in the vocabulary sequence, and determine the probability of the vocabulary sequence based on the determined first and second information, wherein, for the vocabulary in the vocabulary sequence, the first The second information is the second information determined based on the vocabulary adjacent to the vocabulary; the
  • determining the probability of the vocabulary sequence based on the determined first information and second information includes: connecting two adjacent vocabularies in the vocabulary sequence to generate a segmentation path, wherein the segmentation path The nodes of are represented by the words in the vocabulary sequence, and the edges of the word segmentation path are lines used to connect the words; based on the first and second information of the words in the vocabulary sequence, determine the weight of the edge of the word segmentation path; based on the determined Weight to determine the probability of the vocabulary sequence.
  • the second information of the vocabulary is the second information determined based on the vocabulary adjacent to the vocabulary and located before the vocabulary.
  • determining the second information of the vocabulary in the vocabulary sequence includes: for the vocabulary in the vocabulary sequence, performing the following steps: determining whether the vocabulary sequence includes a word adjacent to the vocabulary and located before the vocabulary. Vocabulary; in response to determining including determining second information of the vocabulary based on the vocabulary adjacent to the vocabulary and preceding the vocabulary.
  • the preset vocabulary set is obtained by the following generating steps: obtaining the preset text set and a sample word segmentation result pre-labeled for the preset text in the preset text set; and using the preset text in the preset text set as Input, using the sample word segmentation result corresponding to the input preset text as the expected output, using machine learning methods to train and obtain the word segmentation model; use the word segmentation model to segment the preset text in the preset text set to obtain the first segmentation result ; Generating an initial vocabulary set based on the obtained first word segmentation result, wherein the vocabulary in the initial vocabulary set includes first information determined based on the obtained first word segmentation result; Segmentation of the preset text to obtain a second segmentation result; generating a preset vocabulary set based on the initial vocabulary set and the obtained second segmentation result, wherein the vocabulary in the preset vocabulary set includes the first information and is based on the obtained The second information determined by the second segmentation result.
  • training to obtain a word segmentation model includes: training at least two initial models determined in advance to obtain at least two word segmentation models; and using the word segmentation model to segment words of a preset text in a preset text set to obtain The first segmentation result includes: segmenting the preset text in the preset text set by using at least two segmentation models to obtain at least two first segmentation results.
  • the generating step before generating an initial vocabulary set based on the obtained first segmentation results, further includes: extracting the same vocabulary from the obtained at least two first segmentation results; and based on the obtained first segmentation results; and A segmentation result to generate an initial vocabulary set includes generating an initial vocabulary set based on the extracted vocabulary and the obtained first segmentation result.
  • segmenting the text to be segmented to obtain at least one vocabulary sequence includes: matching the text to be segmented and a preset text format to determine whether the text to be segmented includes text that matches the preset text format; The determining includes segmenting the to-be-segmented text based on a preset vocabulary set and the determined and matched text to obtain at least one vocabulary sequence, wherein the vocabulary sequence includes the determined and matched text.
  • segmenting the text to be segmented to obtain at least one vocabulary sequence includes: performing named entity recognition on the text to be segmented to determine whether the text to be segmented includes a named entity; and in response to the determining including, based on a preset vocabulary set and the The determined named entity performs word segmentation on the tokenized text to obtain at least one vocabulary sequence, where the vocabulary sequence includes the determined named entity.
  • the method further includes: obtaining a preset candidate vocabulary set, wherein the vocabulary in the candidate vocabulary set is used to represent at least the following One item: movie name, TV series name, music name; match the word segmentation result and the words in the candidate vocabulary set to determine whether the word segmentation result includes phrases that match the words in the candidate vocabulary set, where the phrase includes adjacent At least two words; in response to determining including, determining a matching phrase as a new word, and generating a new word segmentation result including the new word.
  • the present application provides a device for word segmentation.
  • the device includes: a first obtaining unit configured to obtain a preset vocabulary set and a text to be segmented, wherein the preset vocabulary set is based on the preset text set A pre-generated vocabulary set.
  • the vocabulary in the preset vocabulary set includes first information and second information. The first information is used to characterize the probability of the vocabulary appearing in the preset text set.
  • the information is used to characterize the presence of vocabulary other than the vocabulary as a condition in the preset text set, and the conditional probability of the vocabulary appearing;
  • the text segmentation unit is configured to segment the text to be segmented based on the set vocabulary set to obtain at least one Vocabulary sequence;
  • a probability determination unit configured to determine, for at least one vocabulary sequence, first and second information of the vocabulary in the vocabulary sequence, and determine, based on the determined first and second information, The probability of the vocabulary sequence, wherein, for the vocabulary in the vocabulary sequence, the second information of the vocabulary is based on The second information determined by the adjacent vocabulary is collected;
  • the sequence selection unit is configured to select a vocabulary sequence with the highest probability from at least one vocabulary sequence as a word segmentation result.
  • the probability determination unit includes: a path generation module configured to connect two adjacent words in the vocabulary sequence to generate a segmentation path, wherein the nodes of the segmentation path are formed by the vocabulary in the vocabulary sequence. Representation, the edges of the segmentation path are lines for connecting vocabularies; the weight determination module is configured to determine the weights of the edges of the segmentation path based on the first information and the second information of the vocabulary in the vocabulary sequence; the probability determination module is It is configured to determine the probability of the vocabulary sequence based on the determined weights.
  • the second information of the vocabulary is the second information determined based on the vocabulary adjacent to the vocabulary and located before the vocabulary.
  • the probability determination unit is further configured to perform the following steps for the vocabulary in the vocabulary sequence: determine whether the vocabulary sequence includes a vocabulary adjacent to the vocabulary and located before the vocabulary; , Based on the vocabulary adjacent to the vocabulary and before the vocabulary, determining the second information of the vocabulary.
  • the preset vocabulary set is obtained by the following generating steps: obtaining the preset text set and a sample word segmentation result pre-labeled for the preset text in the preset text set; and using the preset text in the preset text set as Input, using the sample word segmentation result corresponding to the input preset text as the expected output, using machine learning methods to train and obtain the word segmentation model; use the word segmentation model to segment the preset text in the preset text set to obtain the first segmentation result ; Generating an initial vocabulary set based on the obtained first word segmentation result, wherein the vocabulary in the initial vocabulary set includes first information determined based on the obtained first word segmentation result; Segmentation of the preset text to obtain a second segmentation result; generating a preset vocabulary set based on the initial vocabulary set and the obtained second segmentation result, wherein the vocabulary in the preset vocabulary set includes the first information and is based on the obtained The second information determined by the second segmentation result.
  • training to obtain a word segmentation model includes: training at least two initial models determined in advance to obtain at least two word segmentation models; and using the word segmentation model to segment words of a preset text in a preset text set to obtain The first segmentation result includes: segmenting the preset text in the preset text set by using at least two segmentation models to obtain at least two first segmentation results.
  • the generating step before generating an initial vocabulary set based on the obtained first segmentation results, further includes: extracting the same vocabulary from the obtained at least two first segmentation results; and based on the obtained first segmentation results; and A segmentation result to generate an initial vocabulary set includes generating an initial vocabulary set based on the extracted vocabulary and the obtained first segmentation result.
  • the text segmentation unit includes: a text matching module configured to match the text to be segmented and a preset text format to determine whether the text to be segmented includes text that matches the preset text format; a first segmentation module Is configured to, in response to the determining, include segmenting the to-be-segmented text based on a preset vocabulary set and the determined, matched text to obtain at least one vocabulary sequence, wherein the vocabulary sequence includes the determined, matched text.
  • the text segmentation unit includes: a text recognition module configured to perform named entity recognition on the text to be segmented to determine whether the text to be segmented includes a named entity; and a second segmentation module configured to respond to the determination including, based on Preset the vocabulary set and the determined named entity, and perform segmentation on the segmented text to obtain at least one vocabulary sequence, where the vocabulary sequence includes the determined named entity.
  • the apparatus further includes: a second obtaining unit configured to obtain a preset candidate vocabulary set, wherein the vocabulary in the candidate vocabulary set is used to represent at least one of the following: movie name, TV series name, music Name; a vocabulary matching unit configured to match the word segmentation result with words in a candidate vocabulary set to determine whether the word segmentation result includes a phrase that matches a word in the candidate vocabulary set, wherein the phrase includes at least two adjacent Vocabulary; a result generation unit configured to determine a matching phrase as a new vocabulary in response to determining the inclusion, and generate a new segmentation result including the new vocabulary.
  • a second obtaining unit configured to obtain a preset candidate vocabulary set, wherein the vocabulary in the candidate vocabulary set is used to represent at least one of the following: movie name, TV series name, music Name
  • a vocabulary matching unit configured to match the word segmentation result with words in a candidate vocabulary set to determine whether the word segmentation result includes a phrase that matches a word in the candidate vocabulary set, wherein the phrase includes at least two adjacent Vocabul
  • an embodiment of the present application provides an electronic device including: one or more processors; a storage device that stores one or more programs thereon; when one or more programs are processed by one or more processors Execution causes one or more processors to implement the method of any one of the foregoing methods for word segmentation.
  • an embodiment of the present application provides a computer-readable medium having stored thereon a computer program that, when executed by a processor, implements the method of any one of the foregoing methods for word segmentation.
  • the method and device for word segmentation obtained in the embodiments of the present application obtain a preset vocabulary set and a text to be segmented, wherein the preset vocabulary set is a vocabulary set generated in advance based on the preset text set, and the vocabulary in the preset vocabulary set.
  • the first information is used to characterize the probability of the vocabulary appearing in the preset text set, and for the words in the preset vocabulary set, the second information is used to characterize in the preset text set to divide The occurrence of vocabulary other than the vocabulary as a condition, the conditional probability of the vocabulary occurrence, and then based on a preset vocabulary set, segment the text to be segmented to obtain at least one vocabulary sequence, and then determine the vocabulary sequence for the vocabulary sequence in at least one vocabulary sequence
  • the first information and the second information of the vocabulary in the vocabulary, and the probability of the vocabulary sequence is determined based on the determined first information and the second information, wherein, for the vocabulary in the vocabulary sequence, the second information of the vocabulary is based on and The second information determined by the adjacent words of the word, and finally the most probable is selected from at least one word sequence Vocabulary word sequence as a result, so that for vocabulary words in the text to be divided, the effective use of the words first information and second information to determine the segmentation result, improve the accuracy of segmentation.
  • FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present application can be applied;
  • FIG. 1 is an exemplary system architecture diagram to which an embodiment of the present application can be applied;
  • FIG. 2 is a flowchart of an embodiment of a method for word segmentation according to the present application
  • FIG. 3 is a schematic diagram of an application scenario of a method for word segmentation according to an embodiment of the present application
  • FIG. 4 is a flowchart of still another embodiment of a method for word segmentation according to the present application.
  • FIG. 5 is a schematic structural diagram of an embodiment of a device for word segmentation according to the present application.
  • FIG. 6 is a schematic structural diagram of a computer system suitable for implementing an electronic device according to an embodiment of the present application.
  • FIG. 1 illustrates an exemplary system architecture 100 of an embodiment of a method for word segmentation or an apparatus for word segmentation to which the present application can be applied.
  • the system architecture 100 may include terminal devices 101, 102, and 103, a network 104, and a server 105.
  • the network 104 is a medium for providing a communication link between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
  • the user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like.
  • Various communication client applications can be installed on the terminal devices 101, 102, 103, such as language processing software, web browser applications, search applications, instant communication tools, email clients, social platform software, and so on.
  • the terminal devices 101, 102, and 103 may be hardware or software.
  • the terminal device 101, 102, 103 When the terminal device 101, 102, 103 is hardware, it can be various electronic devices with a display screen, including but not limited to smartphones, tablets, e-book readers, MP3 players (Moving Pictures Experts Group Audio Layer III, Motion picture expert compression standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer 4), player, laptop portable computer and desktop computer, etc.
  • the terminal devices 101, 102, and 103 are software, they can be installed in the electronic devices listed above. It can be implemented as multiple software or software modules (such as multiple software or software modules used to provide distributed services), or it can be implemented as a single software or software module. It is not specifically limited here.
  • the server 105 may be a server that provides various services, for example, a text processing server that segments the text to be segmented sent by the terminal devices 101, 102, and 103.
  • the text processing server may analyze and process the received data such as the text to be segmented to obtain a processing result (for example, a segmentation result).
  • the server may be hardware or software.
  • the server can be implemented as a distributed server cluster consisting of multiple servers or as a single server.
  • the server can be implemented as multiple software or software modules (for example, multiple software or software modules used to provide distributed services), or it can be implemented as a single software or software module. It is not specifically limited here.
  • the numbers of terminal devices, networks, and servers in FIG. 1 are merely exemplary. According to implementation needs, there can be any number of terminal devices, networks, and servers.
  • the above system architecture may not include a network, but only a terminal device or a server.
  • the method for word segmentation includes the following steps:
  • Step 201 Obtain a preset vocabulary set and text to be segmented.
  • an execution subject for example, a server shown in FIG. 1
  • a server shown in FIG. 1
  • a terminal for example, a terminal device shown in FIG. 1
  • the text to be segmented is a text to be segmented, and may be a phrase, a sentence, or an article including a vocabulary.
  • the preset vocabulary set is a vocabulary set for word segmentation.
  • the preset vocabulary set may be generated in advance based on the preset text set.
  • the preset text is a text predetermined by a technician for obtaining a vocabulary set for word segmentation.
  • search term search term is a vocabulary, phrase, or sentence used for search
  • the vocabulary in the preset vocabulary set includes first information and second information.
  • the first information is used to represent a probability that a vocabulary appears in a preset text set, and may include, but is not limited to, at least one of the following: text, numbers, and symbols.
  • the second information of the vocabulary is used to characterize the vocabulary appearance other than the vocabulary as a condition in the preset text set.
  • the conditional probability of the vocabulary occurrence may include but is not limited to One item: text, numbers, symbols.
  • the preset text set includes two preset texts, which are: “Today's Weather”; “Today's sunshine makes my mood shine.”
  • the preset vocabulary set obtained based on the preset text set may include the following words: “today”; “weather”; “sunshine”; “mood”.
  • the second information of "weather” may be "two: 100%”; when the word “sunshine” appears as a condition, the probability of the word “today” appears to be 100%, so the word “today” is relative to the word “sunshine”
  • the second information may be "two: 100%”; when the vocabulary “mood” appears as a condition, the probability of the vocabulary “today” appearing is 100%, so the second information of the vocabulary “today” relative to the vocabulary “mood” may be "Two: 100%”.
  • the probability of the word “weather” appearing is 50%, so the word “weather” is second to the word “today”
  • the information can be "two: 50%”; when the word “sunshine” appears as a condition, the probability of the word “weather” appearing is 0%, so the second information of the word “weather” relative to the word “sunshine” can be “ 2: 0% "; when the vocabulary” mood "appears as a condition, the probability that the vocabulary” weather “appears is 0%, so the second information of the vocabulary” weather “relative to the vocabulary” mood "may be” two: 0% " "”.
  • the second information of the word “sunshine” relative to the word “today” can be "two: 50%”
  • the second information relative to the word “weather” can be “two: 0%”
  • the second information on the vocabulary “mood” may be "two: 100%”.
  • the second information of the word “mood” with respect to the word “today” may be "two: 50%”
  • the second information with respect to the word “weather” may be "two: 0%”
  • the second information with respect to the word "sunshine” The second information may be "two: 100%”.
  • the foregoing preset vocabulary set may be obtained through the following generating steps:
  • Step 2011 Obtain a preset text set and a sample word segmentation result pre-labeled for the preset text in the preset text set.
  • the sample word segmentation result may be a result marked in advance by a technician.
  • the segmentation result can be a vocabulary sequence composed of the words obtained by the segmentation. For example, for the preset text "Today's Weather", the sample segmentation result corresponding to it can be the sample vocabulary sequence "Today”; "Weather”.
  • step 2012 the preset text in the preset text set is used as an input, and the sample word segmentation result corresponding to the input preset text is used as an expected output.
  • a machine learning method is used to train and obtain a segmentation model.
  • the segmentation model can be used to characterize the correspondence between text and segmentation results.
  • the word segmentation model can be trained based on various existing models for language processing (such as CRF (Conditional Random Field), HMM (Hidden Markov Model, Hidden Markov Model, etc.)). It should be noted that the method of training to obtain the word segmentation model is a well-known technology that is widely studied and applied at present, and is not repeated here.
  • At least two initial models determined in advance may be trained to obtain at least two word segmentation models.
  • the initial model and the word segmentation model correspond one-to-one.
  • CRF and HMM can be used as two initial models for training to obtain the word segmentation model, and then two word segmentation models (including the word segmentation model corresponding to CRF and the word segmentation model corresponding to HMM) can be trained.
  • step 2013, the word segmentation model is used to segment the preset text in the preset text set to obtain a first segmentation result.
  • the preset text may be input into the word segmentation model obtained in step 2012 to obtain a segmentation result, and the obtained segmentation result is determined as the first segmentation result.
  • this step may further use the at least two word segmentation models on the preset text. Segment the preset text in the set to obtain at least two first segmentation results.
  • the first segmentation result corresponds to the segmentation model one-to-one.
  • Step 2014 Based on the obtained first segmentation result, an initial vocabulary set is generated.
  • the vocabulary in the initial vocabulary set includes first information determined based on the obtained first segmentation result.
  • a vocabulary may be selected from the obtained first segmentation result as a vocabulary in an initial vocabulary set. Then, for each vocabulary in the selected vocabulary, the probability that the vocabulary appears in the obtained first segmentation result is determined, and first information of the vocabulary is generated. Furthermore, an initial vocabulary set may be generated based on the selected vocabulary and the first information of the vocabulary.
  • all words in the obtained first word segmentation result may be directly determined as words in the initial word set; or, words other than words may be selected from the obtained first word segmentation result as words in the initial word set .
  • the generating step may further include: from the obtained at least two first segmentation results Extracting the same vocabulary; and step 2014 may include generating an initial vocabulary set based on the extracted vocabulary and the obtained first word segmentation result.
  • Step 2015 segment the preset text in the preset text set based on the initial vocabulary set to obtain a second segmentation result.
  • various methods can be used to segment the preset text in the preset text set to obtain the segmentation result, and the obtained segmentation result is determined as the second segmentation result.
  • a maximum forward matching algorithm, a maximum reverse matching algorithm, a minimum forward matching algorithm, a minimum reverse matching algorithm, etc. may be used to segment the preset text in the preset text set to obtain a segmentation result.
  • the words in the second word segmentation result belong to the initial set of words, so the words in the second word segmentation result also include the first information.
  • step 2016, a preset vocabulary set is generated based on the initial vocabulary set and the obtained second word segmentation result.
  • the vocabulary in the preset vocabulary set includes first information and second information determined based on the obtained second word segmentation result.
  • a vocabulary may be selected from the initial vocabulary set as a vocabulary in a preset vocabulary set. Then, for each vocabulary in the selected vocabulary, determine the condition that each other vocabulary appears in the obtained second segmentation result as a condition, and the conditional probability that the vocabulary appears in the obtained second segmentation result (that is, in each other If a vocabulary appears in the obtained second segmentation result, the probability that the vocabulary appears in the obtained second segmentation result), and then the second information of the vocabulary is generated. Finally, a preset vocabulary set may be generated based on the selected vocabulary and the first and second information of the vocabulary. It can be understood that, since the vocabulary in the initial vocabulary set includes the first information, after the second information is determined, the vocabulary in the preset vocabulary set may include both the first information and the second information.
  • all words in the obtained first word segmentation result may be directly determined as words in the initial vocabulary set; or, the probability indicated by the included first information may be greater than or equal to The threshold vocabulary is used as the vocabulary in the preset vocabulary set.
  • the execution subject of the above-mentioned generating steps used to generate the preset vocabulary set may be the same as or different from the execution subject of the method for word segmentation. If they are the same, the execution subject of the above generating step for generating the preset vocabulary set may store the preset vocabulary set locally after obtaining the preset vocabulary set. If they are different, the execution subject of the above generating step for generating the preset vocabulary set may send the preset vocabulary set to the execution subject of the method for segmentation after obtaining the preset vocabulary set.
  • Step 202 Segment the text to be segmented based on a preset vocabulary set to obtain at least one vocabulary sequence.
  • the execution subject may perform word segmentation on the segmented text to obtain at least one vocabulary sequence.
  • the above-mentioned execution subject may use at least two preset methods based on a preset vocabulary set to perform segmentation on the segmented text to obtain at least one vocabulary sequence. It should be noted that, using two different methods for segmenting the segmented text may obtain the same vocabulary sequence, so here, the above-mentioned execution subject may use at least two preset method for segmentation to obtain at least one vocabulary sequence.
  • the above-mentioned execution subject may further perform word segmentation on the segmented text through the following steps to obtain at least one vocabulary sequence: first, the above-mentioned execution subject may match the segmented text with a preset text format, To determine whether the text to be segmented includes text that matches a preset text format. Then, the above-mentioned execution subject may, in response to the determining, include segmenting the to-be-segmented text based on the preset vocabulary set and the determined and matched text to obtain at least one vocabulary sequence.
  • the vocabulary sequence includes the determined and matching text.
  • the preset text format is a format predetermined by a technician. The preset text format can be used to indicate text that meets preset rules.
  • the preset text format can be "x year y month z day", where x, y, z can be used to represent any number. Further, the preset text format may be used to indicate text representing a date (including a date of “year, month, and day”).
  • the preset text format is “x, y, month, and z days”.
  • the participle text is "Today is September 6, 2018”.
  • the above-mentioned executive body can perform segmentation on the segmented text by the following steps: First, the above-mentioned executive body matches the segmented text “today is September 6, 2018” with a preset text format “x year y month z day” to obtain a relative Matching text "September 6, 2018". Then, for the mismatched text "yes today", the above-mentioned execution subject may segment the mismatched text based on a preset vocabulary set, for example, the result "today”; "yes” may be obtained. Finally, the above execution body can use the matching text "September 6, 2018” as the vocabulary in the vocabulary sequence and the result "today”; “yes” constitute the final vocabulary sequence "today”; "yes”; “2018” September 6, “.
  • the above-mentioned execution subject may further perform word segmentation on the segmented text by using the following steps to obtain at least one vocabulary sequence: First, the above-mentioned execution subject may perform named entity recognition on the segmented text to determine the target segmentation text. Whether the tokenized text includes named entities. Then, the above-mentioned execution subject may respond to the determination including segmenting the to-be-segmented text based on the preset vocabulary set and the determined named entity to obtain at least one vocabulary sequence, wherein the vocabulary sequence includes the determined named entity.
  • named entities refer to the names of persons, institutions, places, and all other entities identified by names.
  • entity refers to vocabulary.
  • the above-mentioned execution subject may use various methods to perform named entity recognition on the tokenized text. For example, a technician may establish a named entity set in advance, and then the execution subject may match the segmented text with the named entities in the named entity set to determine whether the text to be segmented includes a named entity; or the execution subject may use a pre-trained
  • the named entity recognition model recognizes the segmented text to determine whether the segmented text includes named entities.
  • the named entity recognition model can be obtained by training based on various existing models (such as CRF, HMM, etc.) for performing language processing. It should be noted that the method of training to obtain a named entity recognition model is a well-known technique that is widely studied and applied at present, and is not repeated here.
  • the text to be segmented is "Today is Li Si's birthday”
  • the above-mentioned execution subject can segment the to-be-separated segment by the following steps: First, the above-mentioned execution subject can treat the segmented text "Today is Li Si's birthday” The named entity is identified, and the named entity "Li Si” is obtained. Then, for the unnamed entity "Today's Birthday", the above-mentioned execution subject can segment the word based on a preset vocabulary set, for example, to obtain the results "Today”; "Yes"; ""; "Birthday".
  • Step 203 For a vocabulary sequence in at least one vocabulary sequence, determine first and second information of the vocabulary in the vocabulary sequence, and determine a probability of the vocabulary sequence based on the determined first and second information.
  • the execution entity may determine the first information and the second information of the vocabulary in the vocabulary sequence, and based on the determined first information and The second information determines the probability of the vocabulary sequence.
  • the second information of the vocabulary is the second information determined based on the vocabulary adjacent to the vocabulary.
  • the vocabulary in the vocabulary sequence obtained based on the preset vocabulary belongs to the preset vocabulary set
  • the vocabulary in the vocabulary sequence may include first information and second information.
  • the vocabulary in the preset vocabulary set may include multiple second information (corresponding to the appearance of different vocabulary as a condition), and here, for the vocabulary in the vocabulary sequence, the second information of the vocabulary is to be related to the vocabulary Adjacent words appear as conditional second information.
  • the second information of the vocabulary may be second information determined based on a vocabulary adjacent to the vocabulary and located before the vocabulary.
  • the execution subject may determine the second information of the vocabulary through the following steps: First, the execution subject may determine whether the vocabulary sequence includes a vocabulary adjacent to the vocabulary and located before the vocabulary. Then, the execution subject may determine the second information of the vocabulary based on the vocabulary adjacent to the vocabulary and before the vocabulary in response to determining that the vocabulary sequence includes the vocabulary adjacent to the vocabulary and precedes the vocabulary.
  • the above-mentioned execution subject may further determine the preset second information as the second information of the vocabulary in response to determining that the vocabulary sequence does not include a vocabulary adjacent to the vocabulary and located before the vocabulary.
  • the preset second information includes a probability preset by a technician.
  • the execution entity may determine the probability of the vocabulary sequence by using various methods based on the determined first information and second information. For example, the probabilities indicated by the first information and the probabilities indicated by the second information of each vocabulary in the vocabulary sequence may be firstly summed to obtain the summation result as the probability corresponding to the vocabulary; then the vocabulary sequence The probabilities corresponding to each vocabulary are summed, and the summed result is obtained as the probability of the vocabulary sequence.
  • Step 204 Select a vocabulary sequence with the highest probability from at least one vocabulary sequence as a segmentation result.
  • the execution subject may select a vocabulary sequence with the highest probability from the at least one vocabulary sequence as a word segmentation result.
  • the execution subject may directly determine the vocabulary sequence as a word segmentation result.
  • the foregoing execution body may further perform the following steps:
  • the execution body can obtain a preset candidate vocabulary set.
  • the vocabulary in the candidate vocabulary set is used to represent but is not limited to at least one of the following: movie name, TV series name, and music name.
  • the execution body may match the segmentation result in step 204 with the vocabulary in the candidate vocabulary set to determine whether the segmentation result includes a phrase that matches the vocabulary in the candidate vocabulary set.
  • the phrase includes at least two words adjacent to each other.
  • the above-mentioned execution subject may determine the matching phrase as a new vocabulary, and generate a new segmentation result including the new vocabulary.
  • the segmentation results are "I”; “Like”; “Fate”; “Symphony”.
  • the candidate vocabulary set includes the music name "Symphony of Destiny”. Furthermore, after the above-mentioned execution subject matches the word segmentation result with "I”; “Like”; “Fate”; “Symphony” and the candidate vocabulary set, it can be determined that the word segmentation result includes the matching phrase “Fate”; "Symphony” . Therefore, the above-mentioned execution subject can determine the matching phrase “fate”; “symphony” as a new vocabulary “symphony of fate", and produce new participle results "I”; “like”; “symphony of fate”.
  • FIG. 3 is a schematic diagram of an application scenario of the method for word segmentation according to this embodiment.
  • the server 301 first obtains the text to be segmented “Nanjing Yangtze River Bridge” 303 from the terminal 302 that is communicatively connected to the terminal 302, and obtains a preset vocabulary set 304 locally.
  • the preset vocabulary set is a vocabulary set generated in advance based on the preset text set.
  • the vocabulary in the preset vocabulary set includes first information and second information.
  • the first information is used to represent a probability that a vocabulary appears in a preset text set.
  • the second information is used to characterize the conditional probability that the vocabulary appears in the preset text set with the appearance of a vocabulary other than the vocabulary.
  • the server 301 can segment the segmented text 303 based on the preset vocabulary set 304 to obtain a vocabulary sequence 3051 (for example, "Nanjing”; “Yangtze River”; “Bridge") and a vocabulary sequence 3052 (for example, “Nanjing”; "Yangtze River Bridge” ").
  • the server 301 may determine the first information and the second information of the vocabulary in the vocabulary sequence, and based on the determined first and second information, determine a probability 3061 (for example, 50%) of the vocabulary sequence. ).
  • the server 301 may determine the first and second information of the vocabulary in the vocabulary sequence, and determine the probability 3062 (for example, 60) of the vocabulary sequence based on the determined first and second information. %).
  • the second information of the vocabulary is the second information determined based on the vocabulary adjacent to the vocabulary.
  • the server 301 may select the vocabulary sequence 3052 as the segmentation result 307.
  • the method provided by the foregoing embodiment of the present application effectively uses the first information and the second information of the vocabulary to determine the segmentation result, and improves the accuracy of the segmentation.
  • FIG. 4 a flowchart 400 of yet another embodiment of a method for word segmentation is shown.
  • the process 400 of the method for word segmentation includes the following steps:
  • Step 401 Obtain a preset vocabulary set and text to be segmented.
  • an execution subject for example, a server shown in FIG. 1
  • a server shown in FIG. 1
  • a terminal for example, a terminal device shown in FIG. 1
  • the text to be segmented is a text to be segmented, and may be a phrase, a sentence, or an article including a vocabulary.
  • the preset vocabulary set is a vocabulary set for word segmentation.
  • the preset vocabulary set may be generated in advance based on the preset text set.
  • the preset text is a text predetermined by a technician for obtaining a vocabulary set for word segmentation.
  • Step 402 Segment the text to be segmented based on a preset vocabulary set to obtain at least one vocabulary sequence.
  • the above-mentioned execution subject may perform segmentation on the segmented text to obtain at least one vocabulary sequence.
  • Step 403 For the vocabulary sequence in at least one vocabulary sequence, perform the following steps: determine the first information and the second information of the vocabulary in the vocabulary sequence; connect two adjacent vocabularies in the vocabulary sequence to generate a word segmentation Path; based on the first information and the second information of the vocabulary in the vocabulary sequence, determine the weight of the edge of the word segmentation path; based on the determined weight, determine the probability of the vocabulary sequence.
  • the above-mentioned execution subject may perform the following steps:
  • Step 4031 Determine the first information and the second information of the vocabulary in the vocabulary sequence.
  • this step is the same as the method for determining the first information and the second information of the vocabulary in the vocabulary sequence in step 203 in the embodiment corresponding to FIG. 2, and details are not described herein again.
  • Step 4032 Connect two adjacent words in the vocabulary sequence to generate a segmentation path.
  • the nodes of the word segmentation path are represented by words in the vocabulary sequence, and the edges of the word segmentation path are lines for connecting the words.
  • the vocabulary sequence is "Nanjing”; “Yangtze River”; “Bridge”, and the corresponding participle path can be “Nanjing-Yangtze River-Bridge”. It can be understood that the word segmentation path here is a virtual path used to characterize the word segmentation process.
  • Step 4033 Determine the weight of the edge of the segmentation path based on the first information and the second information of the vocabulary in the vocabulary sequence.
  • the edge weight of the segmentation path is used to represent the importance of the segmentation manner represented by the edge.
  • the tokenization method represented by edge refers to the tokenization method of the two words connected by the token.
  • determining the weight of the edge of the word segmentation path specifically refers to the probability indicated by the first information of the vocabulary in the vocabulary sequence and the second information. The probability of determining the weight of the edges of the segmentation path.
  • the execution subject may adopt various methods based on the probability indicated by the first information and the probability indicated by the second information of the two words connected by the edge. Determine the weight of the edge.
  • the second information of the ranked vocabulary in the two vocabularies is the second information relative to the vocabulary ranked first.
  • the first information of the vocabulary ranked first in the two vocabularies may be indicated by The probability is summed with the probability indicated by the second information of the ranked vocabulary to obtain the summation result, and the summation result is determined as the weight of the edge.
  • the weight of the edge may also be determined using the following formula:
  • weight is used to represent the weight of the edge; w i-1 is used to represent the ranked words among the two words connected by the edge; w i is used to represent the ranked words of the two words connected by the edge; log is an operator of logarithmic operation; p (w i ) is used to represent the probability indicated by the first information of the ranked vocabulary; p (w i
  • Step 4034 Determine the probability of the vocabulary sequence based on the determined weight.
  • the above-mentioned execution subject may use various methods to determine the probability of the vocabulary sequence based on the determined weights. For example, the weights of the edges in the segmentation path generated by the vocabulary sequence may be summed to obtain a summation result, and then the obtained summation result is determined as a probability of the vocabulary sequence; or, Sum the weights of the determined edges and the probabilities indicated by the first information of each vocabulary in the segmentation path to obtain a summation result, and determine the obtained summation result as the probability of the vocabulary sequence.
  • Step 404 Select a vocabulary sequence with the highest probability from at least one vocabulary sequence as a segmentation result.
  • the execution subject may select the vocabulary sequence with the highest probability from the at least one vocabulary sequence as the segmentation result.
  • steps 401, 402, and 404 are consistent with steps 201, 202, and 204 in the foregoing embodiment.
  • the descriptions of steps 201, 202, and 204 also apply to steps 401, 402, and 404. , Will not repeat them here.
  • the process 400 of the method for word segmentation in this embodiment highlights the generation of a word segmentation path based on the obtained vocabulary sequence, and determines the edge of the word segmentation path.
  • this application provides an embodiment of a device for word segmentation.
  • the device embodiment corresponds to the method embodiment shown in FIG. 2.
  • the device Specifically, it can be applied to various electronic devices.
  • the apparatus 500 for word segmentation in this embodiment includes a first obtaining unit 501, a text word segmentation unit 502, a probability determination unit 503, and a sequence selection unit 504.
  • the first obtaining unit 501 is configured to obtain a preset vocabulary set and a text to be segmented, where the preset vocabulary set is a vocabulary set generated in advance based on the preset text set, and the vocabulary in the preset vocabulary set includes first information and Second information, the first information is used to characterize the probability of the vocabulary appearing in the preset text set, and for the words in the preset vocabulary set, the second information is used to characterize the vocabulary in the preset text set, in order to vocabulary other than the vocabulary The conditional probability of the occurrence of the vocabulary as a condition; the text segmentation unit 502 is configured to segment the text to be segmented based on a preset vocabulary set to obtain at least one vocabulary sequence; the probability determination unit 503 is configured to perform Vocabulary sequence, determining the first and second information of the vocabulary in the vocabulary sequence,
  • the first obtaining unit 501 of the device 500 for word segmentation may obtain a preset vocabulary from a terminal (such as a terminal device shown in FIG. 1) communicatively connected thereto through a wired connection method or a wireless connection method, or locally. Collection and to-be-segmented text.
  • the text to be segmented is a text to be segmented, and may be a phrase, a sentence, or an article including a vocabulary.
  • the preset vocabulary set is a vocabulary set for word segmentation.
  • the preset vocabulary set may be generated in advance based on the preset text set.
  • the preset text is a text predetermined by a technician for obtaining a vocabulary set for word segmentation.
  • the text segmentation unit 502 may segment the text to be segmented to obtain at least one vocabulary sequence.
  • the probability determination unit 503 may determine first information and second information of the vocabulary in the vocabulary sequence, and based on the determined first information Information and second information to determine the probability of the vocabulary sequence.
  • the second information of the vocabulary is the second information determined based on the vocabulary adjacent to the vocabulary.
  • the sequence selection unit 504 may select the vocabulary sequence with the highest probability from the at least one vocabulary sequence as the segmentation result.
  • the probability determination unit 503 may include: a path generation module (not shown in the figure) configured to connect two adjacent words in the vocabulary sequence to generate a word segmentation Path, where the nodes of the word segmentation path are characterized by words in the vocabulary sequence, and the edges of the word segmentation path are lines for connecting words; the weight determination module (not shown in the figure) is configured to be based on the words in the word sequence The first information and the second information determine the weights of the edges of the segmentation path; the probability determination module (not shown in the figure) is configured to determine the probability of the vocabulary sequence based on the determined weights.
  • a path generation module (not shown in the figure) configured to connect two adjacent words in the vocabulary sequence to generate a word segmentation Path, where the nodes of the word segmentation path are characterized by words in the vocabulary sequence, and the edges of the word segmentation path are lines for connecting words
  • the weight determination module (not shown in the figure) is configured to be based on the words in the word sequence The first information and the second information determine the weights of the
  • the second information of the vocabulary is second information determined based on the vocabulary adjacent to the vocabulary and located before the vocabulary.
  • the probability determination unit 503 may be further configured to perform the following steps for the vocabulary in the vocabulary sequence: determine whether the vocabulary sequence includes adjacent to the vocabulary and is located in the vocabulary sequence A vocabulary before the vocabulary; and in response to determining includes, determining the second information of the vocabulary based on the vocabulary adjacent to the vocabulary and preceding the vocabulary.
  • the preset vocabulary set is obtained by the following generating steps: obtaining a preset text set and a sample word segmentation result pre-labeled for the preset text in the preset text set; and converting the preset text
  • the preset text in the set is used as input, and the sample word segmentation result corresponding to the input preset text is used as the desired output.
  • the machine learning method is used to train the word segmentation model.
  • the word segmentation model is used to perform the preset text in the preset text set.
  • Segmentation to obtain a first segmentation result; based on the obtained first segmentation result, generating an initial vocabulary set, wherein the words in the initial vocabulary set include first information determined based on the obtained first segmentation result; based on the initial vocabulary set , Segmenting the preset text in the preset text set to obtain a second segmentation result; generating a preset vocabulary set based on the initial vocabulary set and the obtained second segmentation result, wherein the words in the preset vocabulary set include the first An information and the second information determined based on the obtained second segmentation result.
  • training to obtain a word segmentation model includes: training at least two initial models determined in advance to obtain at least two word segmentation models; and using the word segmentation model on the preset text set. Segmenting the preset text to obtain the first segmentation result includes segmenting the preset text in the preset text set by using at least two segmentation models to obtain at least two first segmentation results.
  • the text segmentation unit 502 may include: a text matching module (not shown in the figure) configured to match the text to be segmented and a preset text format to determine the text to be segmented Whether to include text that matches a preset text format; a first tokenization module (not shown in the figure) configured to respond to the determination including treating the tokenized text based on the preset vocabulary set and the determined, matched text Perform word segmentation to obtain at least one vocabulary sequence, where the vocabulary sequence includes the determined and matching text.
  • a text matching module (not shown in the figure) configured to match the text to be segmented and a preset text format to determine the text to be segmented Whether to include text that matches a preset text format
  • a first tokenization module (not shown in the figure) configured to respond to the determination including treating the tokenized text based on the preset vocabulary set and the determined, matched text Perform word segmentation to obtain at least one vocabulary sequence, where the vocabulary sequence includes the determined and matching text.
  • the text segmentation unit 502 may include a text recognition module (not shown in the figure) configured to perform named entity recognition on the text to be segmented to determine whether the text to be segmented includes a name.
  • a text recognition module (not shown in the figure) configured to perform named entity recognition on the text to be segmented to determine whether the text to be segmented includes a name.
  • An entity An entity
  • a second word segmentation module (not shown in the figure), configured to respond to the determination including segmenting the text to be segmented based on a preset vocabulary set and the determined named entity to obtain at least one vocabulary sequence, wherein the vocabulary sequence Include the identified named entities.
  • the apparatus 500 may further include: a second obtaining unit (not shown in the figure) configured to obtain a preset candidate vocabulary set, wherein words in the candidate vocabulary set Used to characterize at least one of the following: movie name, TV series name, music name; vocabulary matching unit (not shown in the figure), configured to match the word segmentation result and the words in the candidate word set to determine whether the word segmentation result includes A phrase matching a vocabulary in the candidate vocabulary set, wherein the phrase includes at least two adjacent words; a result generating unit (not shown in the figure) is configured to determine the matching phrase as New vocabulary, and new segmentation results including new vocabulary.
  • a second obtaining unit (not shown in the figure) configured to obtain a preset candidate vocabulary set, wherein words in the candidate vocabulary set Used to characterize at least one of the following: movie name, TV series name, music name
  • vocabulary matching unit (not shown in the figure), configured to match the word segmentation result and the words in the candidate word set to determine whether the word segmentation result includes A phrase matching a vocabulary in the candidate vocabulary set,
  • the apparatus 500 provided by the foregoing embodiment of the present application effectively uses the first information and the second information of a vocabulary to determine a segmentation result, and improves the accuracy of the segmentation.
  • FIG. 6 illustrates a schematic structural diagram of a computer system 600 suitable for implementing an electronic device (such as the terminal device / server shown in FIG. 1) in the embodiment of the present application.
  • the terminal device / server shown in FIG. 6 is only an example, and should not impose any limitation on the functions and scope of use of the embodiments of the present application.
  • the computer system 600 includes a central processing unit (CPU) 601, which can be loaded into a random access memory (RAM) 603 according to a program stored in a read-only memory (ROM) 602 or from a storage portion 608. Instead, perform various appropriate actions and processes.
  • RAM random access memory
  • ROM read-only memory
  • various programs and data required for the operation of the system 600 are also stored.
  • the CPU 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604.
  • An input / output (I / O) interface 605 is also connected to the bus 604.
  • the following components are connected to the I / O interface 605: an input portion 606 including a keyboard, a mouse, and the like; an output portion 607 including a cathode ray tube (CRT), a liquid crystal display (LCD), and the speaker; a storage portion 608 including a hard disk and the like; a communication section 609 including a network interface card such as a LAN card, a modem, and the like.
  • the communication section 609 performs communication processing via a network such as the Internet.
  • the driver 610 is also connected to the I / O interface 605 as necessary.
  • a removable medium 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is installed on the drive 610 as needed, so that a computer program read therefrom is installed into the storage section 608 as needed.
  • the process described above with reference to the flowchart may be implemented as a computer software program.
  • embodiments of the present disclosure include a computer program product including a computer program carried on a computer-readable medium, the computer program containing program code for performing a method shown in a flowchart.
  • the computer program may be downloaded and installed from a network through the communication portion 609, and / or installed from a removable medium 611.
  • CPU central processing unit
  • the computer-readable medium described in this application may be a computer-readable signal medium or a computer-readable storage medium or any combination of the foregoing.
  • the computer-readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programming read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in combination with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal that is included in baseband or propagated as part of a carrier wave, and which carries computer-readable program code. Such a propagated data signal may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, and the computer-readable medium may send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
  • each block in the flowchart or block diagram may represent a module, a program segment, or a part of code, which contains one or more functions to implement a specified logical function Executable instructions.
  • the functions labeled in the blocks may occur in a different order than those labeled in the drawings. For example, two successively represented boxes may actually be executed substantially in parallel, and they may sometimes be executed in the reverse order, depending on the functions involved.
  • each block in the block diagrams and / or flowcharts, and combinations of blocks in the block diagrams and / or flowcharts can be implemented by a dedicated hardware-based system that performs the specified function or operation , Or it can be implemented with a combination of dedicated hardware and computer instructions.
  • a processor includes a first acquisition unit, a text word segmentation unit, a probability determination unit, and a sequence selection unit.
  • a processor includes a first acquisition unit, a text word segmentation unit, a probability determination unit, and a sequence selection unit.
  • the names of these units do not constitute a limitation on the unit itself in some cases.
  • a text segmentation unit can also be described as a "unit to segment the segmented word.”
  • the present application also provides a computer-readable medium, which may be included in the electronic device described in the foregoing embodiments; or may exist alone without being assembled into the electronic device in.
  • the computer-readable medium carries one or more programs, and when the one or more programs are executed by the electronic device, the electronic device: obtains a preset vocabulary set and a text to be segmented, where the preset vocabulary set is based on A pre-generated vocabulary set of a preset text set.
  • the vocabulary in the preset vocabulary set includes first information and second information.
  • the first information is used to represent a probability that a vocabulary appears in the preset text set.
  • the vocabulary and the second information are used to characterize the conditional probability of the occurrence of a vocabulary other than the vocabulary in the preset text set.
  • segment the text to be segmented to obtain at least one vocabulary sequence For a vocabulary sequence in at least one vocabulary sequence, determining first and second information of the vocabulary in the vocabulary sequence, and determining a probability of the vocabulary sequence based on the determined first and second information, where, for Vocabulary in a vocabulary sequence, and the second information of the vocabulary is determined based on the vocabulary adjacent to the vocabulary Second information; selecting the most probable sequence of words from at least one of the vocabulary word sequence as a result.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种用于分词的方法和装置(500),该方法包括:获取预设词汇集合和待分词文本(201),其中,预设词汇集合为基于预设文本集合预先生成的词汇集合,预设词汇集合中的词汇包括第一信息和第二信息;基于预设词汇集合,对待分词文本进行分词,获得至少一个词汇序列(202);对于至少一个词汇序列中的词汇序列,确定该词汇序列中的词汇的第一信息和第二信息,以及基于所确定的第一信息和第二信息,确定该词汇序列的概率(203),其中,对于词汇序列中的词汇,该词汇的第二信息为基于与该词汇相邻的词汇确定出的第二信息;从至少一个词汇序列中选取概率最大的词汇序列作为分词结果(204)。该方法和装置提高了分词的准确性。

Description

用于分词的方法和装置
本专利申请要求于2018年9月14日提交的、申请号为201811076566.7、申请人为北京字节跳动网络技术有限公司、发明名称为“用于分词的方法和装置”的中国专利申请的优先权,该申请的全文以引用的方式并入本申请中。
技术领域
本申请实施例涉及计算机技术领域,尤其涉及用于分词的方法和装置。
背景技术
通常,分词指的是中文分词。通过分词,可以将一个汉字序列切分成一个或多个词语。
分词是文本挖掘的基础。通过分词,可以使计算机自动识别语句含义。在这里,这种通过分词,使得计算机自动识别语句含义的方法又叫做机械分词方法,它的主要原理是按照一定的策略将待分析汉字串与预先设置的机器词典中的词条进行匹配,以确定出待分析汉字串所对应的目标词条。
发明内容
本申请实施例提出了用于分词的方法和装置。
第一方面,本申请实施例提供了一种用于分词的方法,该方法包括:获取预设词汇集合和待分词文本,其中,预设词汇集合为基于预设文本集合预先生成的词汇集合,预设词汇集合中的词汇包括第一信息和第二信息,第一信息用于表征词汇在预设文本集合中出现的概率,对于预设词汇集合中的词汇,第二信息用于表征在预设文本集合中,以除该词汇以外的词汇出现作为条件,该词汇出现的条件概率;基于 预设词汇集合,对待分词文本进行分词,获得至少一个词汇序列;对于至少一个词汇序列中的词汇序列,确定该词汇序列中的词汇的第一信息和第二信息,以及基于所确定的第一信息和第二信息,确定该词汇序列的概率,其中,对于词汇序列中的词汇,该词汇的第二信息为基于与该词汇相邻的词汇确定出的第二信息;从至少一个词汇序列中选取概率最大的词汇序列作为分词结果。
在一些实施例中,基于所确定的第一信息和第二信息,确定该词汇序列的概率,包括:对该词汇序列中相邻的两个词汇进行连线,生成分词路径,其中,分词路径的节点由该词汇序列中的词汇表征,分词路径的边为用于连接词汇的线;基于该词汇序列中的词汇的第一信息和第二信息,确定分词路径的边的权重;基于所确定的权重,确定该词汇序列的概率。
在一些实施例中,对于词汇序列中的词汇,该词汇的第二信息为基于与该词汇相邻,且位于该词汇之前的词汇确定出的第二信息。
在一些实施例中,确定该词汇序列中的词汇的第二信息,包括:对于该词汇序列中的词汇,执行以下步骤:确定该词汇序列是否包括与该词汇相邻,且位于该词汇之前的词汇;响应于确定包括,基于与该词汇相邻,且位于该词汇之前的词汇,确定该词汇的第二信息。
在一些实施例中,预设词汇集合通过以下生成步骤获得:获取预设文本集合和针对预设文本集合中的预设文本预先标注的样本分词结果;将预设文本集合中的预设文本作为输入,将所输入的预设文本所对应的样本分词结果作为期望输出,利用机器学习方法,训练得到分词模型;利用分词模型对预设文本集合中的预设文本进行分词,获得第一分词结果;基于所获得的第一分词结果,生成初始词汇集合,其中,初始词汇集合中的词汇包括基于所获得的第一分词结果确定出的第一信息;基于初始词汇集合,对预设文本集合中的预设文本进行分词,获得第二分词结果;基于初始词汇集合和所获得的第二分词结果,生成预设词汇集合,其中,预设词汇集合中的词汇包括第一信息和基于所获得的第二分词结果确定出的第二信息。
在一些实施例中,训练得到分词模型,包括:对预先确定的至少 两个初始模型进行训练,得到至少两个分词模型;以及利用分词模型对预设文本集合中的预设文本进行分词,获得第一分词结果,包括:利用至少两个分词模型对预设文本集合中的预设文本进行分词,获得至少两个第一分词结果。
在一些实施例中,在基于所获得的第一分词结果,生成初始词汇集合之前,生成步骤还包括:从所获得的至少两个第一分词结果中提取相同的词汇;以及基于所获得的第一分词结果,生成初始词汇集合,包括:基于所提取的词汇和所获得的第一分词结果,生成初始词汇集合。
在一些实施例中,对待分词文本进行分词,获得至少一个词汇序列,包括:对待分词文本和预设文本格式进行匹配,以确定待分词文本是否包括与预设文本格式相匹配的文本;响应于确定包括,基于预设词汇集合和所确定的、相匹配的文本,对待分词文本进行分词,获得至少一个词汇序列,其中,词汇序列包括所确定的、相匹配的文本。
在一些实施例中,对待分词文本进行分词,获得至少一个词汇序列,包括:对待分词文本进行命名实体识别,以确定待分词文本是否包括命名实体;响应于确定包括,基于预设词汇集合和所确定的命名实体,对待分词文本进行分词,获得至少一个词汇序列,其中,词汇序列包括所确定的命名实体。
在一些实施例中,在从至少一个词汇序列中选取概率最大的词汇序列作为分词结果之后,该方法还包括:获取预设的候选词汇集合,其中,候选词汇集合中的词汇用于表征以下至少一项:电影名称、电视剧名称、音乐名称;对分词结果和候选词汇集合中的词汇进行匹配,以确定分词结果是否包括与候选词汇集合中的词汇相匹配的词组,其中,词组包括相邻的至少两个词汇;响应于确定包括,将相匹配的词组确定为新的词汇,以及生成包括新的词汇的新的分词结果。
第二方面,本申请提供了一种用于分词的装置,该装置包括:第一获取单元,被配置成获取预设词汇集合和待分词文本,其中,预设词汇集合为基于预设文本集合预先生成的词汇集合,预设词汇集合中的词汇包括第一信息和第二信息,第一信息用于表征词汇在预设文本 集合中出现的概率,对于预设词汇集合中的词汇,第二信息用于表征在预设文本集合中,以除该词汇以外的词汇出现作为条件,该词汇出现的条件概率;文本分词单元,被配置成基于设词汇集合,对待分词文本进行分词,获得至少一个词汇序列;概率确定单元,被配置成对于至少一个词汇序列中的词汇序列,确定该词汇序列中的词汇的第一信息和第二信息,以及基于所确定的第一信息和第二信息,确定该词汇序列的概率,其中,对于词汇序列中的词汇,该词汇的第二信息为基于与该词汇相邻的词汇确定出的第二信息;序列选取单元,被配置成从至少一个词汇序列中选取概率最大的词汇序列作为分词结果。
在一些实施例中,概率确定单元包括:路径生成模块,被配置成对该词汇序列中相邻的两个词汇进行连线,生成分词路径,其中,分词路径的节点由该词汇序列中的词汇表征,分词路径的边为用于连接词汇的线;权重确定模块,被配置成基于该词汇序列中的词汇的第一信息和第二信息,确定分词路径的边的权重;概率确定模块,被配置成基于所确定的权重,确定该词汇序列的概率。
在一些实施例中,对于词汇序列中的词汇,该词汇的第二信息为基于与该词汇相邻,且位于该词汇之前的词汇确定出的第二信息。
在一些实施例中,概率确定单元进一步被配置成:对于该词汇序列中的词汇,执行以下步骤:确定该词汇序列是否包括与该词汇相邻,且位于该词汇之前的词汇;响应于确定包括,基于与该词汇相邻,且位于该词汇之前的词汇,确定该词汇的第二信息。
在一些实施例中,预设词汇集合通过以下生成步骤获得:获取预设文本集合和针对预设文本集合中的预设文本预先标注的样本分词结果;将预设文本集合中的预设文本作为输入,将所输入的预设文本所对应的样本分词结果作为期望输出,利用机器学习方法,训练得到分词模型;利用分词模型对预设文本集合中的预设文本进行分词,获得第一分词结果;基于所获得的第一分词结果,生成初始词汇集合,其中,初始词汇集合中的词汇包括基于所获得的第一分词结果确定出的第一信息;基于初始词汇集合,对预设文本集合中的预设文本进行分词,获得第二分词结果;基于初始词汇集合和所获得的第二分词结果, 生成预设词汇集合,其中,预设词汇集合中的词汇包括第一信息和基于所获得的第二分词结果确定出的第二信息。
在一些实施例中,训练得到分词模型,包括:对预先确定的至少两个初始模型进行训练,得到至少两个分词模型;以及利用分词模型对预设文本集合中的预设文本进行分词,获得第一分词结果,包括:利用至少两个分词模型对预设文本集合中的预设文本进行分词,获得至少两个第一分词结果。
在一些实施例中,在基于所获得的第一分词结果,生成初始词汇集合之前,生成步骤还包括:从所获得的至少两个第一分词结果中提取相同的词汇;以及基于所获得的第一分词结果,生成初始词汇集合,包括:基于所提取的词汇和所获得的第一分词结果,生成初始词汇集合。
在一些实施例中,文本分词单元包括:文本匹配模块,被配置成对待分词文本和预设文本格式进行匹配,以确定待分词文本是否包括与预设文本格式相匹配的文本;第一分词模块,被配置成响应于确定包括,基于预设词汇集合和所确定的、相匹配的文本,对待分词文本进行分词,获得至少一个词汇序列,其中,词汇序列包括所确定的、相匹配的文本。
在一些实施例中,文本分词单元包括:文本识别模块,被配置成对待分词文本进行命名实体识别,以确定待分词文本是否包括命名实体;第二分词模块,被配置成响应于确定包括,基于预设词汇集合和所确定的命名实体,对待分词文本进行分词,获得至少一个词汇序列,其中,词汇序列包括所确定的命名实体。
在一些实施例中,该装置还包括:第二获取单元,被配置成获取预设的候选词汇集合,其中,候选词汇集合中的词汇用于表征以下至少一项:电影名称、电视剧名称、音乐名称;词汇匹配单元,被配置成对分词结果和候选词汇集合中的词汇进行匹配,以确定分词结果是否包括与候选词汇集合中的词汇相匹配的词组,其中,词组包括相邻的至少两个词汇;结果生成单元,被配置成响应于确定包括,将相匹配的词组确定为新的词汇,以及生成包括新的词汇的新的分词结果。
第三方面,本申请实施例提供了一种电子设备,包括:一个或多个处理器;存储装置,其上存储有一个或多个程序,当一个或多个程序被一个或多个处理器执行,使得一个或多个处理器实现上述用于分词的方法中任一实施例的方法。
第四方面,本申请实施例提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理器执行时实现上述用于分词的方法中任一实施例的方法。
本申请实施例提供的用于分词的方法和装置,通过获取预设词汇集合和待分词文本,其中,预设词汇集合为基于预设文本集合预先生成的词汇集合,预设词汇集合中的词汇包括第一信息和第二信息,第一信息用于表征词汇在预设文本集合中出现的概率,对于预设词汇集合中的词汇,第二信息用于表征在预设文本集合中,以除该词汇以外的词汇出现作为条件,该词汇出现的条件概率,而后基于预设词汇集合,对待分词文本进行分词,获得至少一个词汇序列,接着对于至少一个词汇序列中的词汇序列,确定该词汇序列中的词汇的第一信息和第二信息,以及基于所确定的第一信息和第二信息,确定该词汇序列的概率,其中,对于词汇序列中的词汇,该词汇的第二信息为基于与该词汇相邻的词汇确定出的第二信息,最后从至少一个词汇序列中选取概率最大的词汇序列作为分词结果,从而对于待分词文本中的词汇,有效利用了词汇的第一信息和第二信息来确定分词结果,提高了分词的准确性。
附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本申请的其它特征、目的和优点将会变得更明显:
图1是本申请的一个实施例可以应用于其中的示例性系统架构图;
图2是根据本申请的用于分词的方法的一个实施例的流程图;
图3是根据本申请实施例的用于分词的方法的一个应用场景的示意图;
图4是根据本申请的用于分词的方法的又一个实施例的流程图;
图5是根据本申请的用于分词的装置的一个实施例的结构示意图;
图6是适于用来实现本申请实施例的电子设备的计算机系统的结构示意图。
具体实施方式
下面结合附图和实施例对本申请作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释相关发明,而非对该发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与有关发明相关的部分。
需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请。
图1示出了可以应用本申请的用于分词的方法或用于分词的装置的实施例的示例性系统架构100。
如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如语言处理软件、网页浏览器应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。
终端设备101、102、103可以是硬件,也可以是软件。当终端设备101、102、103为硬件时,可以是具有显示屏的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机 等等。当终端设备101、102、103为软件时,可以安装在上述所列举的电子设备中。其可以实现成多个软件或软件模块(例如用来提供分布式服务的多个软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103发送的待分词文本进行分词的文本处理服务器。文本处理服务器可以对接收到的待分词文本等数据进行分析等处理,获得处理结果(例如分词结果)。
需要说明的是,服务器可以是硬件,也可以是软件。当服务器为硬件时,可以实现成多个服务器组成的分布式服务器集群,也可以实现成单个服务器。当服务器为软件时,可以实现成多个软件或软件模块(例如用来提供分布式服务的多个软件或软件模块),也可以实现成单个软件或软件模块。在此不做具体限定。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。在待分词文本或者生成待分词文本所对应的分词结果的过程中所使用的数据不需要从远程获取的情况下,上述系统架构可以不包括网络,而只包括终端设备或服务器。
继续参考图2,示出了根据本申请的用于分词的方法的一个实施例的流程200。该用于分词的方法,包括以下步骤:
步骤201,获取预设词汇集合和待分词文本。
在本实施例中,用于分词的方法的执行主体(例如图1所示的服务器)可以通过有线连接方式或者无线连接方式从与之通信连接的终端(例如图1所示的终端设备)或者本地获取预设词汇集合和待分词文本。其中,待分词文本为待对其进行分词的文本,可以为包括词汇的短语、句子或者文章等。
预设词汇集合为用于分词的词汇集合。预设词汇集合可以基于预设文本集合预先生成。预设文本为技术人员预先确定的、用于获得用于分词的词汇集合的文本。例如用户输入的搜索词(搜索词为用于搜 索的词汇、短语或者句子)、网站中发表的文章、报纸中的新闻等。预设词汇集合中的词汇包括第一信息和第二信息。第一信息用于表征词汇在预设文本集合中出现的概率,可以包括但不限于以下至少一项:文字、数字、符号。对于预设词汇集合中的词汇,该词汇的第二信息用于表征在预设文本集合中,以除该词汇以外的词汇出现作为条件,该词汇出现的条件概率,可以包括但不限于以下至少一项:文字、数字、符号。
作为示例,预设文本集合包括两个预设文本,分别为:“今日天气”;“今日的阳光让我的心情都阳光起来”。基于预设文本集合得到的预设词汇集合可以包括以下词汇:“今日”;“天气”;“阳光”;“心情”。
首先分析第一信息,对于预设词汇集合中的词汇“今日”,可以看出,两个预设文本中都包括“今日”,故“今日”所对应的第一信息可以为“一:100%”;对于词汇“天气”,可以看出,只有第一个预设文本中包括“天气”,故“天气”所对应的第一信息可以为“一:50%”;对于词汇“阳光”,可以看出,只有第二个预设文本中包括“阳光”,故“阳光”所对应的第一信息可以为“一:50%”;对于词汇“心情”,可以看出,只有第二个预设文本中包括“心情”,故“心情”所对应的第一信息可以为“一:50%”。需要说明的是,对于词汇“阳光”,虽然该词汇出现了两次,但是均出现在了第二个预设文本中,而未出现在第一个预设文本中,故该词汇的第一信息为“一:50%”。
接着分析第二信息,对于词汇“今日”,包括以下分析:可以看出,当以词汇“天气”出现作为条件时,词汇“今日”出现的概率为100%,故词汇“今日”相对于词汇“天气”的第二信息可以为“二:100%”;当以词汇“阳光”出现作为条件时,词汇“今日”出现的概率为100%,故词汇“今日”相对于词汇“阳光”的第二信息可以为“二:100%”;当以词汇“心情”出现作为条件时,词汇“今日”出现的概率为100%,故词汇“今日”相对于词汇“心情”的第二信息可以为“二:100%”。
对于词汇“天气”,包括以下分析:可以看出,当以词汇“今日”出现作为条件时,词汇“天气”出现的概率为50%,故词汇“天气”相对于词汇“今日”的第二信息可以为“二:50%”;当以词汇“阳光” 出现作为条件时,词汇“天气”出现的概率为0%,故词汇“天气”相对于词汇“阳光”的第二信息可以为“二:0%”;当以词汇“心情”出现作为条件时,词汇“天气”出现的概率为0%,故词汇“天气”相对于词汇“心情”的第二信息可以为“二:0%”。
以此类推,可以确定出词汇“阳光”相对于词汇“今日”的第二信息可以为“二:50%”,相对于词汇“天气”的第二信息可以为“二:0%”,相对于词汇“心情”的第二信息可以为“二:100%”。词汇“心情”相对于词汇“今日”的第二信息可以为“二:50%”,相对于词汇“天气”的第二信息可以为“二:0%”,相对于词汇“阳光”的第二信息可以为“二:100%”。
在本实施例的一些可选的实现方式中,上述预设词汇集合可以通过以下生成步骤获得:
步骤2011,获取预设文本集合和针对预设文本集合中的预设文本预先标注的样本分词结果。
其中,样本分词结果可以为技术人员预先标注的结果。实践中,分词结果可以为分词得到的词汇所组成的词汇序列。例如,对于预设文本“今日天气”,其所对应的样本分词结果可以为样本词汇序列“今日”;“天气”。
步骤2012,将预设文本集合中的预设文本作为输入,将所输入的预设文本所对应的样本分词结果作为期望输出,利用机器学习方法,训练得到分词模型。
在这里,分词模型可以用于表征文本与分词结果的对应关系。具体的,分词模型可以基于现有的各种用于进行语言处理的模型(例如CRF(Conditional Random Field,条件随机场)、HMM(Hidden Markov Model,隐马尔可夫模型)等)训练得到。需要说明的是,训练获得分词模型的方法是目前广泛研究和应用的公知技术,此处不再赘述。
在本实施例的一些可选的实现方式中,可以对预先确定的至少两个初始模型进行训练,得到至少两个分词模型。其中,初始模型与分词模型一一对应。例如,可以将CRF和HMM作为用于训练获得分词模型的两个初始模型,进而可以训练得到两个分词模型(包括CRF所 对应的分词模型和HMM所对应的分词模型)。
步骤2013,利用分词模型对预设文本集合中的预设文本进行分词,获得第一分词结果。
具体的,对于预设文本集合中的每个预设文本,可以将该预设文本输入步骤2012中得到的分词模型,获得分词结果,并将所获得的分词结果确定为第一分词结果。
在本实施例的一些可选的实现方式中,当步骤2012对预先确定的至少两个初始模型进行训练,得到至少两个分词模型时,本步骤可以进一步利用至少两个分词模型对预设文本集合中的预设文本进行分词,获得至少两个第一分词结果。其中,第一分词结果与分词模型一一对应。
步骤2014,基于所获得的第一分词结果,生成初始词汇集合。
其中,初始词汇集合中的词汇包括基于所获得的第一分词结果确定出的第一信息。
具体的,可以首先从所获得的第一分词结果选取词汇作为初始词汇集合中的词汇。然后对于所选取的词汇中的每个词汇,确定该词汇在所获得的第一分词结果中出现的概率,生成该词汇的第一信息。进而,可以基于所选取的词汇以及词汇的第一信息生成初始词汇集合。
需要说明的是,可以采用各种方法从所获得的第一分词结果中选取词汇作为初始词汇集合中的词汇。例如,可以直接将所获得的第一分词结果中的所有词汇确定为初始词汇集合中的词汇;或者,可以从所获得的第一分词结果中选取除了单字以外的词汇作为初始词汇集合中的词汇。
在本实施例的一些可选的实现方式中,当步骤2014获得了至少两个第一分词结果时,在步骤2014之前,生成步骤还可以包括:从所获得的至少两个第一分词结果中提取相同的词汇;以及步骤2014可以包括:基于所提取的词汇和所获得的第一分词结果,生成初始词汇集合。
步骤2015,基于初始词汇集合,对预设文本集合中的预设文本进行分词,获得第二分词结果。
具体的,可以基于初始词汇集合,采用各种方法对预设文本集合 中的预设文本进行分词,获得分词结果,并将所获得的分词结果确定为第二分词结果。例如,可以采用最大正向匹配算法、最大逆向匹配算法、最小正向匹配算法、最小逆向匹配算法等,对预设文本集合中的预设文本进行分词,获得分词结果。可以理解,第二分词结果中的词汇属于初始词汇集合,故第二分词结果中的词汇也包括第一信息。
需要说明的是,基于词汇集合对文本进行分词的方法是目前广泛研究和应用的公知技术,此处不再赘述。
步骤2016,基于初始词汇集合和所获得的第二分词结果,生成预设词汇集合。
其中,预设词汇集合中的词汇包括第一信息和基于所获得的第二分词结果确定出的第二信息。
具体的,可以首先从初始词汇集合中选取词汇作为预设词汇集合中的词汇。然后对于所选取的词汇中的每个词汇,确定以其他各个词汇在所获得的第二分词结果中出现作为条件,该词汇在所获得的第二分词结果中出现的条件概率(即在其他各个词汇在所获得的第二分词结果中出现的情况下,该词汇在所获得的第二分词结果中出现的概率),进而生成该词汇的第二信息。最后,可以基于所选取的词汇以及词汇的第一信息和第二信息生成预设词汇集合。可以理解,由于初始词汇集合中的词汇包括第一信息,故确定出第二信息后,预设词汇集合中的词汇可以同时包括第一信息和第二信息。
需要说明的是,可以采用各种方法从初始词汇集合中选取词汇作为预设词汇集合中的词汇。例如,可以直接将所获得的第一分词结果中的所有词汇确定为初始词汇集合中的词汇;或者,可以从所获得的初始词汇集合中选取所包括的第一信息所指示的概率大于等于预设阈值的词汇作为预设词汇集合中的词汇。
还需要说明的是,实践中,用于生成预设词汇集合的上述生成步骤的执行主体可以与用于分词的方法的执行主体相同或者不同。如果相同,则用于生成预设词汇集合的上述生成步骤的执行主体可以在得到预设词汇集合后将预设词汇集合存储在本地。如果不同,则用于生成预设词汇集合的上述生成步骤的执行主体可以在得到预设词汇集合 后将预设词汇集合发送给用于分词的方法的执行主体。
步骤202,基于预设词汇集合,对待分词文本进行分词,获得至少一个词汇序列。
在本实施例中,基于步骤201中获取的预设词汇集合,上述执行主体可以对待分词文本进行分词,获得至少一个词汇序列。
具体的,上述执行主体可以基于预设词汇集合,采用预设的至少两种方法,对待分词文本进行分词,获得至少一个词汇序列。需要说明的是,采用两种不同的方法对待分词文本进行分词,可能得到相同的词汇序列,故在这里,上述执行主体可以采用预设的至少两个方法分词,获得至少一个词汇序列。
在本实施例的一些可选的实现方式中,上述执行主体还可以通过以下步骤对待分词文本进行分词,获得至少一个词汇序列:首先,上述执行主体可以对待分词文本和预设文本格式进行匹配,以确定待分词文本是否包括与预设文本格式相匹配的文本。然后,上述执行主体可以响应于确定包括,基于预设词汇集合和所确定的、相匹配的文本,对待分词文本进行分词,获得至少一个词汇序列。其中,词汇序列包括所确定的、相匹配的文本。预设文本格式为技术人员预先确定的格式。预设文本格式可以用于指示符合预设规则的文本。例如,预设文本格式可以为“x年y月z日”,其中,x,y,z可以用于表征任意数字。进而,预设文本格式可以用于指示表征日期(包括“年月日”的日期)的文本。
进一步,示例性的,预设文本格式为“x年y月z日”。待分词文本为“今天是2018年9月6日”。则上述执行主体可以通过以下步骤对待分词文本进行分词:首先,上述执行主体对待分词文本“今天是2018年9月6日”和预设文本格式“x年y月z日”进行匹配,得到相匹配的文本“2018年9月6日”。然后,对于不相匹配的文本“今天是”,上述执行主体可以基于预设词汇集合对该不相匹配的文本进行分词,例如可以得到结果“今天”;“是”。最后,上述执行主体可以将相匹配的文本“2018年9月6日”作为词汇序列中的词汇,与结果“今天”;“是”组成最终的词汇序列“今天”;“是”;“2018年9月6日”。
在本实施例的一些可选的实现方式中,上述执行主体还可以通过以下步骤对待分词文本进行分词,获得至少一个词汇序列:首先,上述执行主体可以对待分词文本进行命名实体识别,以确定待分词文本是否包括命名实体。然后,上述执行主体可以响应于确定包括,基于预设词汇集合和所确定的命名实体,对待分词文本进行分词,获得至少一个词汇序列,其中,词汇序列包括所确定的命名实体。其中,命名实体指的是人名、机构名、地名以及其他所有以名称为标识的实体。在这里,实体指的是词汇。
具体的,上述执行主体可以采用各种方法对待分词文本进行命名实体识别。例如,技术人员可以预先建立命名实体集合,然后上述执行主体可以对待分词文本和命名实体集合中的命名实体进行匹配,以确定待分词文本是否包括命名实体;或者,上述执行主体可以利用预先训练的命名实体识别模型对待分词文本进行识别,以确定待分词文本是否包括命名实体。其中,命名实体识别模型可以为基于现有的各种用于进行语言处理的模型(例如CRF、HMM等)训练得到。需要说明的是,训练获得命名实体识别模型的方法是目前广泛研究和应用的公知技术,此处不再赘述。
作为示例,待分词文本为“今天是李四的生日”,则上述执行主体可以通过以下步骤对该待分词分本进行分词:首先,上述执行主体可以对待分词文本“今天是李四的生日”进行命名实体识别,得到命名实体“李四”。然后,对于非命名实体“今天是的生日”,上述执行主体可以基于预设词汇集合对其进行分词,例如可以得到结果“今天”;“是”;“的”;“生日”。最后,上述执行主体可以将得到的命名实体“李四”作为词汇序列中的词汇,与结果“今天”;“是”;“的”;“生日”组成最终的词汇序列“今天”;“是”;“李四”“的”;“生日”。
步骤203,对于至少一个词汇序列中的词汇序列,确定该词汇序列中的词汇的第一信息和第二信息,以及基于所确定的第一信息和第二信息,确定该词汇序列的概率。
在本实施例中,对于步骤202中得到的至少一个词汇序列中的词汇序列,上述执行主体可以确定该词汇序列中的词汇的第一信息和第 二信息,以及基于所确定的第一信息和第二信息,确定该词汇序列的概率。其中,对于词汇序列中的词汇,该词汇的第二信息为基于与该词汇相邻的词汇确定出的第二信息。
可以理解,由于基于预设词汇集合获得的词汇序列中的词汇属于预设词汇集合,故词汇序列中的词汇可以包括第一信息和第二信息。特别之处在于,预设词汇集合中的词汇可以包括多个第二信息(对应将不同词汇出现作为条件),而这里,对于词汇序列中的词汇,该词汇的第二信息为将与该词汇相邻的词汇出现作为条件的第二信息。
在本实施例的一些可选的实现方式中,对于词汇序列中的词汇,该词汇的第二信息可以为基于与该词汇相邻,且位于该词汇之前的词汇确定出的第二信息。
在本实施例的一些可选的实现方式中,对于词汇序列中的词汇,当该词汇的第二信息为基于与该词汇相邻,且位于该词汇之前的词汇确定出的第二信息时,上述执行主体可以通过以下步骤确定该词汇的第二信息:首先,上述执行主体可以确定该词汇序列是否包括与该词汇相邻,且位于该词汇之前的词汇。然后,上述执行主体可以响应于确定该词汇序列包括与该词汇相邻,且位于该词汇之前的词汇,基于与该词汇相邻,且位于该词汇之前的词汇,确定该词汇的第二信息。
特别的,上述执行主体还可以响应于确定该词汇序列不包括与该词汇相邻,且位于该词汇之前的词汇,将预设第二信息确定为该词汇的第二信息。其中,预设第二信息包括技术人员预设的概率。
在本实施例中,对于所获得的至少一个词汇序列中的词汇序列,上述执行主体可以基于所确定的第一信息和第二信息,采用各种方法确定该词汇序列的概率。例如,可以首先对该词汇序列中的每个词汇的第一信息所指示的概率和第二信息所指示的概率进行求和,获得求和结果作为该词汇所对应的概率;然后对该词汇序列中每个词汇所对应的概率进行求和,获得求和结果作为该词汇序列的概率。
步骤204,从至少一个词汇序列中选取概率最大的词汇序列作为分词结果。
在本实施例中,基于步骤202中得到的至少一个词汇序列和步骤 203中得到的词汇序列的概率,上述执行主体可以从至少一个词汇序列中选取概率最大的词汇序列作为分词结果。
需要说明的是,当上述至少一个词汇序列中仅包括一个词汇序列时,上述执行主体可以直接将该词汇序列确定为分词结果。
在本实施例的一些可选的实现方式中,在从至少一个词汇序列中选取概率最大的词汇序列作为分词结果之后,上述执行主体还可以执行以下步骤:
首先,上述执行主体可以获取预设的候选词汇集合。其中,候选词汇集合中的词汇用于表征但不限于以下至少一项:电影名称、电视剧名称、音乐名称。
然后,上述执行主体可以对步骤204分词结果和候选词汇集合中的词汇进行匹配,以确定分词结果是否包括与候选词汇集合中的词汇相匹配的词组。其中,词组包括相邻的至少两个词汇。
最后,响应于确定分词结果包括与候选词汇集合中的词汇相匹配的词组,上述执行主体可以将相匹配的词组确定为新的词汇,以及生成包括新的词汇的新的分词结果。
作为示例,分词结果为“我”;“喜欢”;“命运”;“交响曲”。候选词汇集合中包括音乐名称“命运交响曲”。进而,上述执行主体对分词结果为“我”;“喜欢”;“命运”;“交响曲”和候选词汇集合进行匹配后,可以确定分词结果包括相匹配的词组“命运”;“交响曲”。故上述执行主体可以将相匹配的词组“命运”;“交响曲”确定为新的词汇“命运交响曲”,及生产新的分词结果“我”;“喜欢”;“命运交响曲”。
继续参见图3,图3是根据本实施例的用于分词的方法的应用场景的一个示意图。在图3的应用场景中,服务器301首先从与之通信连接的终端302获取待分词文本“南京长江大桥”303,以及从本地获取预设词汇集合304。其中,预设词汇集合为基于预设文本集合预先生成的词汇集合。预设词汇集合中的词汇包括第一信息和第二信息。第一信息用于表征词汇在预设文本集合中出现的概率。对于预设词汇集合中的词汇,第二信息用于表征在预设文本集合中,以除该词汇以外的词汇出现作为条件,该词汇出现的条件概率。然后,服务器301 可以基于预设词汇集合304,对待分词文本303进行分词,获得词汇序列3051(例如“南京”;“长江”;“大桥”)和词汇序列3052(例如“南京”;“长江大桥”)。然后,对于词汇序列3051,服务器301可以确定该词汇序列中的词汇的第一信息和第二信息,以及基于所确定的第一信息和第二信息,确定该词汇序列的概率3061(例如50%)。同理,对于词汇序列3052,服务器301可以确定该词汇序列中的词汇的第一信息和第二信息,以及基于所确定的第一信息和第二信息,确定该词汇序列的概率3062(例如60%)。这里,对于词汇序列中的词汇,该词汇的第二信息为基于与该词汇相邻的词汇确定出的第二信息。最后,由于概率3062大于概率3061(60%大于50%),故服务器301可以选取词汇序列3052作为分词结果307。
本申请的上述实施例提供的方法有效利用了词汇的第一信息和第二信息来确定分词结果,提高了分词的准确性。
进一步参考图4,其示出了用于分词的方法的又一个实施例的流程400。该用于分词的方法的流程400,包括以下步骤:
步骤401,获取预设词汇集合和待分词文本。
在本实施例中,用于分词的方法的执行主体(例如图1所示的服务器)可以通过有线连接方式或者无线连接方式从与之通信连接的终端(例如图1所示的终端设备)或者本地获取预设词汇集合和待分词文本。其中,待分词文本为待对其进行分词的文本,可以为包括词汇的短语、句子或者文章等。
预设词汇集合为用于分词的词汇集合。预设词汇集合可以基于预设文本集合预先生成。预设文本为技术人员预先确定的、用于获得用于分词的词汇集合的文本。
步骤402,基于预设词汇集合,对待分词文本进行分词,获得至少一个词汇序列。
在本实施例中,基于步骤401中获取的预设词汇集合,上述执行主体可以对待分词文本进行分词,获得至少一个词汇序列。
步骤403,对于至少一个词汇序列中的词汇序列,执行以下步骤: 确定该词汇序列中的词汇的第一信息和第二信息;对该词汇序列中相邻的两个词汇进行连线,生成分词路径;基于该词汇序列中的词汇的第一信息和第二信息,确定分词路径的边的权重;基于所确定的权重,确定该词汇序列的概率。
在本实施例中,对于步骤402中得到的至少一个词汇序列中的词汇序列,上述执行主体可以执行以下步骤:
步骤4031,确定该词汇序列中的词汇的第一信息和第二信息。
在这里,该步骤与图2所对应的实施例中的步骤203中的确定词汇序列中的词汇的第一信息和第二信息的方法相同,此处不再赘述。
步骤4032,对该词汇序列中相邻的两个词汇进行连线,生成分词路径。
其中,分词路径的节点由该词汇序列中的词汇表征,分词路径的边为用于连接词汇的线。例如词汇序列为“南京”;“长江”;“大桥”,则其所对应的分词路径可以为“南京-长江-大桥”。可以理解,这里的分词路径为用于表征分词过程的虚拟路径。
步骤4033,基于该词汇序列中的词汇的第一信息和第二信息,确定分词路径的边的权重。
其中,分词路径的边的权重用于表征边所表征的分词方式的重要程度。边所表征的分词方式指的是分词获得边所连接的两个词汇的分词方式。
这里,基于该词汇序列中的词汇的第一信息和第二信息,确定分词路径的边的权重具体指的是基于该词汇序列中的词汇的第一信息所指示的概率和第二信息所指示的概率,确定分词路径的边的权重。
具体的,对于分词路径所包括的边中的每个边,上述执行主体可以基于该边所连接的两个词汇的第一信息所指示的概率和第二信息所指示的概率,采用各种方法确定该边的权重。例如,两个词汇中排序在后的词汇的第二信息为相对于排序在前的词汇的第二信息,此时,可以对两个词汇中,排序在前的词汇的第一信息所指示的概率与排序在后的词汇的第二信息所指示的概率求和,获得求和结果,并将求和结果确定为该边的权重。
可选的,当两个词汇中,排序在后的词汇的第二信息为相对于排序在前的词汇的第二信息时,还可以采用如下公式确定该边的权重:
weight=α·log(p(w i))+(1-α)·log(p(w i|w i-1))
其中,weight用于表征边的权重;w i-1用于表征边所连接的两个词汇中排序在前的词汇;w i用于表征边所连接的两个词汇中排序在后的词汇;log为对数运算的运算符;p(w i)用于表征排序在后的词汇的第一信息所指示的概率;p(w i|w i-1)用于表征排序在后的词汇的、相对于排序在前的词汇的第二信息所指示的概率;α为预先确定的、大于等于0且小于等于1的系数。
步骤4034,基于所确定的权重,确定该词汇序列的概率。
在这里,上述执行主体可以采用各种方法基于所确定的权重,确定该词汇序列的概率。例如,可以对所确定的、该词汇序列所生成的分词路径中的各个边的权重进行求和,获得求和结果,进而将所获得的求和结果确定为该词汇序列的概率;或者,可以对所确定的各个边的权重以及分词路径中的各个词汇的第一信息所指示的概率进行求和,获得求和结果,并将所获得的求和结果确定为该词汇序列的概率。
步骤404,从至少一个词汇序列中选取概率最大的词汇序列作为分词结果。
在本实施例中,基于步骤402中得到的至少一个词汇序列和步骤403中得到的词汇序列的概率,上述执行主体可以从至少一个词汇序列中选取概率最大的词汇序列作为分词结果。
上述步骤401、步骤402、步骤404分别与前述实施例中的步骤201、步骤202、步骤204一致,上文针对步骤201、步骤202和步骤204的描述也适用于步骤401、步骤402和步骤404,此处不再赘述。
从图4中可以看出,与图2对应的实施例相比,本实施例中的用于分词的方法的流程400突出了基于所获得的词汇序列生成分词路径,确定分词路径中的边的权重,并基于所确定的权重,确定词汇序列的概率的步骤。由此,本实施例描述的方案可以引入更多用于确定词汇序列的概率的数据,从而可以实现更为准确的分词。
进一步参考图5,作为对上述各图所示方法的实现,本申请提供了一种用于分词的装置的一个实施例,该装置实施例与图2所示的方法实施例相对应,该装置具体可以应用于各种电子设备中。
如图5所示,本实施例的用于分词的装置500包括:第一获取单元501、文本分词单元502、概率确定单元503和序列选取单元504。其中,第一获取单元501被配置成获取预设词汇集合和待分词文本,其中,预设词汇集合为基于预设文本集合预先生成的词汇集合,预设词汇集合中的词汇包括第一信息和第二信息,第一信息用于表征词汇在预设文本集合中出现的概率,对于预设词汇集合中的词汇,第二信息用于表征在预设文本集合中,以除该词汇以外的词汇出现作为条件,该词汇出现的条件概率;文本分词单元502被配置成基于预设词汇集合,对待分词文本进行分词,获得至少一个词汇序列;概率确定单元503被配置成对于至少一个词汇序列中的词汇序列,确定该词汇序列中的词汇的第一信息和第二信息,以及基于所确定的第一信息和第二信息,确定该词汇序列的概率,其中,对于词汇序列中的词汇,该词汇的第二信息为基于与该词汇相邻的词汇确定出的第二信息;序列选取单元504被配置成从至少一个词汇序列中选取概率最大的词汇序列作为分词结果。
在本实施例中,用于分词的装置500的第一获取单元501可以通过有线连接方式或者无线连接方式从与之通信连接的终端(例如图1所示的终端设备)或者本地获取预设词汇集合和待分词文本。其中,待分词文本为待对其进行分词的文本,可以为包括词汇的短语、句子或者文章等。
预设词汇集合为用于分词的词汇集合。预设词汇集合可以基于预设文本集合预先生成。预设文本为技术人员预先确定的、用于获得用于分词的词汇集合的文本。
在本实施例中,基于第一获取单元501获取的预设词汇集合,文本分词单元502可以对待分词文本进行分词,获得至少一个词汇序列。
在本实施例中,对于文本分词单元502得到的至少一个词汇序列中的词汇序列,概率确定单元503可以确定该词汇序列中的词汇的第 一信息和第二信息,以及基于所确定的第一信息和第二信息,确定该词汇序列的概率。其中,对于词汇序列中的词汇,该词汇的第二信息为基于与该词汇相邻的词汇确定出的第二信息。
在本实施例中,基于文本分词单元502得到的至少一个词汇序列和概率确定单元503得到的词汇序列的概率,序列选取单元504可以从至少一个词汇序列中选取概率最大的词汇序列作为分词结果。
在本实施例的一些可选的实现方式中,概率确定单元503可以包括:路径生成模块(图中未示出)被配置成对该词汇序列中相邻的两个词汇进行连线,生成分词路径,其中,分词路径的节点由该词汇序列中的词汇表征,分词路径的边为用于连接词汇的线;权重确定模块(图中未示出),被配置成基于该词汇序列中的词汇的第一信息和第二信息,确定分词路径的边的权重;概率确定模块(图中未示出),被配置成基于所确定的权重,确定该词汇序列的概率。
在本实施例的一些可选的实现方式中,对于词汇序列中的词汇,该词汇的第二信息为基于与该词汇相邻,且位于该词汇之前的词汇确定出的第二信息。
在本实施例的一些可选的实现方式中,概率确定单元503可以进一步被配置成:对于该词汇序列中的词汇,执行以下步骤:确定该词汇序列是否包括与该词汇相邻,且位于该词汇之前的词汇;响应于确定包括,基于与该词汇相邻,且位于该词汇之前的词汇,确定该词汇的第二信息。
在本实施例的一些可选的实现方式中,预设词汇集合通过以下生成步骤获得:获取预设文本集合和针对预设文本集合中的预设文本预先标注的样本分词结果;将预设文本集合中的预设文本作为输入,将所输入的预设文本所对应的样本分词结果作为期望输出,利用机器学习方法,训练得到分词模型;利用分词模型对预设文本集合中的预设文本进行分词,获得第一分词结果;基于所获得的第一分词结果,生成初始词汇集合,其中,初始词汇集合中的词汇包括基于所获得的第一分词结果确定出的第一信息;基于初始词汇集合,对预设文本集合中的预设文本进行分词,获得第二分词结果;基于初始词汇集合和所 获得的第二分词结果,生成预设词汇集合,其中,预设词汇集合中的词汇包括第一信息和基于所获得的第二分词结果确定出的第二信息。
在本实施例的一些可选的实现方式中,训练得到分词模型,包括:对预先确定的至少两个初始模型进行训练,得到至少两个分词模型;以及利用分词模型对预设文本集合中的预设文本进行分词,获得第一分词结果,包括:利用至少两个分词模型对预设文本集合中的预设文本进行分词,获得至少两个第一分词结果。
在本实施例的一些可选的实现方式中,在基于所获得的第一分词结果,生成初始词汇集合之前,生成步骤还可以包括:从所获得的至少两个第一分词结果中提取相同的词汇;以及基于所获得的第一分词结果,生成初始词汇集合可以包括:基于所提取的词汇和所获得的第一分词结果,生成初始词汇集合。
在本实施例的一些可选的实现方式中,文本分词单元502可以包括:文本匹配模块(图中未示出),被配置成对待分词文本和预设文本格式进行匹配,以确定待分词文本是否包括与预设文本格式相匹配的文本;第一分词模块(图中未示出),被配置成响应于确定包括,基于预设词汇集合和所确定的、相匹配的文本,对待分词文本进行分词,获得至少一个词汇序列,其中,词汇序列包括所确定的、相匹配的文本。
在本实施例的一些可选的实现方式中,文本分词单元502可以包括:文本识别模块(图中未示出),被配置成对待分词文本进行命名实体识别,以确定待分词文本是否包括命名实体;第二分词模块(图中未示出),被配置成响应于确定包括,基于预设词汇集合和所确定的命名实体,对待分词文本进行分词,获得至少一个词汇序列,其中,词汇序列包括所确定的命名实体。
在本实施例的一些可选的实现方式中,装置500还可以包括:第二获取单元(图中未示出),被配置成获取预设的候选词汇集合,其中,候选词汇集合中的词汇用于表征以下至少一项:电影名称、电视剧名称、音乐名称;词汇匹配单元(图中未示出),被配置成对分词结果和候选词汇集合中的词汇进行匹配,以确定分词结果是否包括与候选词 汇集合中的词汇相匹配的词组,其中,词组包括相邻的至少两个词汇;结果生成单元(图中未示出),被配置成响应于确定包括,将相匹配的词组确定为新的词汇,以及生成包括新的词汇的新的分词结果。
可以理解的是,该装置500中记载的诸单元与参考图2描述的方法中的各个步骤相对应。由此,上文针对方法描述的操作、特征以及产生的有益效果同样适用于装置500及其中包含的单元,在此不再赘述。
本申请的上述实施例提供的装置500有效利用了词汇的第一信息和第二信息来确定分词结果,提高了分词的准确性。
下面参考图6,其示出了适于用来实现本申请实施例的电子设备(例如图1所示的终端设备/服务器)的计算机系统600的结构示意图。图6示出的终端设备/服务器仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图6所示,计算机系统600包括中央处理单元(CPU)601,其可以根据存储在只读存储器(ROM)602中的程序或者从存储部分608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储有系统600操作所需的各种程序和数据。CPU 601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。
以下部件连接至I/O接口605:包括键盘、鼠标等的输入部分606;包括诸如阴极射线管(CRT)、液晶显示器(LCD)等以及扬声器等的输出部分607;包括硬盘等的存储部分608;以及包括诸如LAN卡、调制解调器等的网络接口卡的通信部分609。通信部分609经由诸如因特网的网络执行通信处理。驱动器610也根据需要连接至I/O接口605。可拆卸介质611,诸如磁盘、光盘、磁光盘、半导体存储器等等,根据需要安装在驱动器610上,以便于从其上读出的计算机程序根据需要被安装入存储部分608。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程 序产品,其包括承载在计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信部分609从网络上被下载和安装,和/或从可拆卸介质611被安装。在该计算机程序被中央处理单元(CPU)601执行时,执行本申请的方法中限定的上述功能。需要说明的是,本申请所述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本申请中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本申请中,计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:无线、电线、光缆、RF等等,或者上述的任意合适的组合。
附图中的流程图和框图,图示了按照本申请各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实 现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本申请实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。所描述的单元也可以设置在处理器中,例如,可以描述为:一种处理器包括第一获取单元、文本分词单元、概率确定单元和序列选取单元。其中,这些单元的名称在某种情况下并不构成对该单元本身的限定,例如,文本分词单元还可以被描述为“对待分词分本进行分词的单元”。
作为另一方面,本申请还提供了一种计算机可读介质,该计算机可读介质可以是上述实施例中描述的电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取预设词汇集合和待分词文本,其中,预设词汇集合为基于预设文本集合预先生成的词汇集合,预设词汇集合中的词汇包括第一信息和第二信息,第一信息用于表征词汇在预设文本集合中出现的概率,对于预设词汇集合中的词汇,第二信息用于表征在预设文本集合中,以除该词汇以外的词汇出现作为条件,该词汇出现的条件概率;基于预设词汇集合,对待分词文本进行分词,获得至少一个词汇序列;对于至少一个词汇序列中的词汇序列,确定该词汇序列中的词汇的第一信息和第二信息,以及基于所确定的第一信息和第二信息,确定该词汇序列的概率,其中,对于词汇序列中的词汇,该词汇的第二信息为基于与该词汇相邻的词汇确定出的第二信息;从至少一个词汇序列中选取概率最大的词汇序列作为分词结果。
以上描述仅为本申请的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本申请中所涉及的发明范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离 上述发明构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本申请中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。

Claims (22)

  1. 一种用于分词的方法,包括:
    获取预设词汇集合和待分词文本,其中,预设词汇集合为基于预设文本集合预先生成的词汇集合,预设词汇集合中的词汇包括第一信息和第二信息,第一信息用于表征词汇在预设文本集合中出现的概率,对于预设词汇集合中的词汇,第二信息用于表征在预设文本集合中,以除该词汇以外的词汇出现作为条件,该词汇出现的条件概率;
    基于所述预设词汇集合,对所述待分词文本进行分词,获得至少一个词汇序列;
    对于所述至少一个词汇序列中的词汇序列,确定该词汇序列中的词汇的第一信息和第二信息,以及基于所确定的第一信息和第二信息,确定该词汇序列的概率,其中,对于词汇序列中的词汇,该词汇的第二信息为基于与该词汇相邻的词汇确定出的第二信息;
    从至少一个词汇序列中选取概率最大的词汇序列作为分词结果。
  2. 根据权利要求1所述的方法,其中,所述基于所确定的第一信息和第二信息,确定该词汇序列的概率,包括:
    对该词汇序列中相邻的两个词汇进行连线,生成分词路径,其中,分词路径的节点由该词汇序列中的词汇表征,分词路径的边为用于连接词汇的线;
    基于该词汇序列中的词汇的第一信息和第二信息,确定分词路径的边的权重;
    基于所确定的权重,确定该词汇序列的概率。
  3. 根据权利要求1所述的方法,其中,对于词汇序列中的词汇,该词汇的第二信息为基于与该词汇相邻,且位于该词汇之前的词汇确定出的第二信息。
  4. 根据权利要求3所述的方法,其中,所述确定该词汇序列中的 词汇的第二信息,包括:
    对于该词汇序列中的词汇,执行以下步骤:确定该词汇序列是否包括与该词汇相邻,且位于该词汇之前的词汇;响应于确定包括,基于与该词汇相邻,且位于该词汇之前的词汇,确定该词汇的第二信息。
  5. 根据权利要求1所述的方法,其中,所述预设词汇集合通过以下生成步骤获得:
    获取所述预设文本集合和针对所述预设文本集合中的预设文本预先标注的样本分词结果;
    将所述预设文本集合中的预设文本作为输入,将所输入的预设文本所对应的样本分词结果作为期望输出,利用机器学习方法,训练得到分词模型;
    利用所述分词模型对所述预设文本集合中的预设文本进行分词,获得第一分词结果;
    基于所获得的第一分词结果,生成初始词汇集合,其中,初始词汇集合中的词汇包括基于所获得的第一分词结果确定出的第一信息;
    基于所述初始词汇集合,对所述预设文本集合中的预设文本进行分词,获得第二分词结果;
    基于所述初始词汇集合和所获得的第二分词结果,生成所述预设词汇集合,其中,预设词汇集合中的词汇包括第一信息和基于所获得的第二分词结果确定出的第二信息。
  6. 根据权利要求5所述的方法,其中,所述训练得到分词模型,包括:
    对预先确定的至少两个初始模型进行训练,得到至少两个分词模型;以及
    所述利用所述分词模型对所述预设文本集合中的预设文本进行分词,获得第一分词结果,包括:
    利用所述至少两个分词模型对所述预设文本集合中的预设文本进行分词,获得至少两个第一分词结果。
  7. 根据权利要求6所述的方法,其中,在所述基于所获得的第一分词结果,生成初始词汇集合之前,所述生成步骤还包括:
    从所获得的至少两个第一分词结果中提取相同的词汇;以及
    所述基于所获得的第一分词结果,生成初始词汇集合,包括:
    基于所提取的词汇和所获得的第一分词结果,生成初始词汇集合。
  8. 根据权利要求1所述的方法,其中,所述对所述待分词文本进行分词,获得至少一个词汇序列,包括:
    对所述待分词文本和预设文本格式进行匹配,以确定所述待分词文本是否包括与所述预设文本格式相匹配的文本;
    响应于确定包括,基于所述预设词汇集合和所确定的、相匹配的文本,对所述待分词文本进行分词,获得至少一个词汇序列,其中,词汇序列包括所确定的、相匹配的文本。
  9. 根据权利要求1所述的方法,其中,所述对所述待分词文本进行分词,获得至少一个词汇序列,包括:
    对所述待分词文本进行命名实体识别,以确定所述待分词文本是否包括命名实体;
    响应于确定包括,基于所述预设词汇集合和所确定的命名实体,对所述待分词文本进行分词,获得至少一个词汇序列,其中,词汇序列包括所确定的命名实体。
  10. 根据权利要求1-9之一所述的方法,其中,在所述从至少一个词汇序列中选取概率最大的词汇序列作为分词结果之后,所述方法还包括:
    获取预设的候选词汇集合,其中,所述候选词汇集合中的词汇用于表征以下至少一项:电影名称、电视剧名称、音乐名称;
    对所述分词结果和所述候选词汇集合中的词汇进行匹配,以确定所述分词结果是否包括与所述候选词汇集合中的词汇相匹配的词组, 其中,词组包括相邻的至少两个词汇;
    响应于确定包括,将相匹配的词组确定为新的词汇,以及生成包括新的词汇的新的分词结果。
  11. 一种用于分词的装置,包括:
    第一获取单元,被配置成获取预设词汇集合和待分词文本,其中,预设词汇集合为基于预设文本集合预先生成的词汇集合,预设词汇集合中的词汇包括第一信息和第二信息,第一信息用于表征词汇在预设文本集合中出现的概率,对于预设词汇集合中的词汇,第二信息用于表征在预设文本集合中,以除该词汇以外的词汇出现作为条件,该词汇出现的条件概率;
    文本分词单元,被配置成基于所述预设词汇集合,对所述待分词文本进行分词,获得至少一个词汇序列;
    概率确定单元,被配置成对于所述至少一个词汇序列中的词汇序列,确定该词汇序列中的词汇的第一信息和第二信息,以及基于所确定的第一信息和第二信息,确定该词汇序列的概率,其中,对于词汇序列中的词汇,该词汇的第二信息为基于与该词汇相邻的词汇确定出的第二信息;
    序列选取单元,被配置成从至少一个词汇序列中选取概率最大的词汇序列作为分词结果。
  12. 根据权利要求11所述的装置,其中,所述概率确定单元包括:
    路径生成模块,被配置成对该词汇序列中相邻的两个词汇进行连线,生成分词路径,其中,分词路径的节点由该词汇序列中的词汇表征,分词路径的边为用于连接词汇的线;
    权重确定模块,被配置成基于该词汇序列中的词汇的第一信息和第二信息,确定分词路径的边的权重;
    概率确定模块,被配置成基于所确定的权重,确定该词汇序列的概率。
  13. 根据权利要求11所述的装置,其中,对于词汇序列中的词汇,该词汇的第二信息为基于与该词汇相邻,且位于该词汇之前的词汇确定出的第二信息。
  14. 根据权利要求13所述的方法,其中,所述概率确定单元进一步被配置成:
    对于该词汇序列中的词汇,执行以下步骤:确定该词汇序列是否包括与该词汇相邻,且位于该词汇之前的词汇;响应于确定包括,基于与该词汇相邻,且位于该词汇之前的词汇,确定该词汇的第二信息。
  15. 根据权利要求11所述的装置,其中,所述预设词汇集合通过以下生成步骤获得:
    获取所述预设文本集合和针对所述预设文本集合中的预设文本预先标注的样本分词结果;
    将所述预设文本集合中的预设文本作为输入,将所输入的预设文本所对应的样本分词结果作为期望输出,利用机器学习方法,训练得到分词模型;
    利用所述分词模型对所述预设文本集合中的预设文本进行分词,获得第一分词结果;
    基于所获得的第一分词结果,生成初始词汇集合,其中,初始词汇集合中的词汇包括基于所获得的第一分词结果确定出的第一信息;
    基于所述初始词汇集合,对所述预设文本集合中的预设文本进行分词,获得第二分词结果;
    基于所述初始词汇集合和所获得的第二分词结果,生成所述预设词汇集合,其中,预设词汇集合中的词汇包括第一信息和基于所获得的第二分词结果确定出的第二信息。
  16. 根据权利要求15所述的装置,其中,所述训练得到分词模型,包括:
    对预先确定的至少两个初始模型进行训练,得到至少两个分词模 型;以及
    所述利用所述分词模型对所述预设文本集合中的预设文本进行分词,获得第一分词结果,包括:
    利用所述至少两个分词模型对所述预设文本集合中的预设文本进行分词,获得至少两个第一分词结果。
  17. 根据权利要求16所述的装置,其中,在所述基于所获得的第一分词结果,生成初始词汇集合之前,所述生成步骤还包括:
    从所获得的至少两个第一分词结果中提取相同的词汇;以及
    所述基于所获得的第一分词结果,生成初始词汇集合,包括:
    基于所提取的词汇和所获得的第一分词结果,生成初始词汇集合。
  18. 根据权利要求11所述的装置,其中,所述文本分词单元包括:
    文本匹配模块,被配置成对所述待分词文本和预设文本格式进行匹配,以确定所述待分词文本是否包括与所述预设文本格式相匹配的文本;
    第一分词模块,被配置成响应于确定包括,基于所述预设词汇集合和所确定的、相匹配的文本,对所述待分词文本进行分词,获得至少一个词汇序列,其中,词汇序列包括所确定的、相匹配的文本。
  19. 根据权利要求11所述的装置,其中,所述文本分词单元包括:
    文本识别模块,被配置成对所述待分词文本进行命名实体识别,以确定所述待分词文本是否包括命名实体;
    第二分词模块,被配置成响应于确定包括,基于所述预设词汇集合和所确定的命名实体,对所述待分词文本进行分词,获得至少一个词汇序列,其中,词汇序列包括所确定的命名实体。
  20. 根据权利要求11-19之一所述的装置,其中,所述装置还包括:
    第二获取单元,被配置成获取预设的候选词汇集合,其中,所述 候选词汇集合中的词汇用于表征以下至少一项:电影名称、电视剧名称、音乐名称;
    词汇匹配单元,被配置成对所述分词结果和所述候选词汇集合中的词汇进行匹配,以确定所述分词结果是否包括与所述候选词汇集合中的词汇相匹配的词组,其中,词组包括相邻的至少两个词汇;
    结果生成单元,被配置成响应于确定包括,将相匹配的词组确定为新的词汇,以及生成包括新的词汇的新的分词结果。
  21. 一种电子设备,包括:
    一个或多个处理器;
    存储装置,其上存储有一个或多个程序,
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-10中任一所述的方法。
  22. 一种计算机可读介质,其上存储有计算机程序,其中,该程序被处理器执行时实现如权利要求1-10中任一所述的方法。
PCT/CN2018/116345 2018-09-14 2018-11-20 用于分词的方法和装置 WO2020052069A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/981,273 US20210042470A1 (en) 2018-09-14 2018-11-20 Method and device for separating words

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811076566.7 2018-09-14
CN201811076566.7A CN109190124B (zh) 2018-09-14 2018-09-14 用于分词的方法和装置

Publications (1)

Publication Number Publication Date
WO2020052069A1 true WO2020052069A1 (zh) 2020-03-19

Family

ID=64911546

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/116345 WO2020052069A1 (zh) 2018-09-14 2018-11-20 用于分词的方法和装置

Country Status (3)

Country Link
US (1) US20210042470A1 (zh)
CN (1) CN109190124B (zh)
WO (1) WO2020052069A1 (zh)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325227A (zh) * 2018-09-14 2019-02-12 北京字节跳动网络技术有限公司 用于生成修正语句的方法和装置
CN109859813B (zh) * 2019-01-30 2020-11-10 新华三大数据技术有限公司 一种实体修饰词识别方法及装置
CN110188355A (zh) * 2019-05-29 2019-08-30 北京声智科技有限公司 一种基于wfst技术的分词方法、系统、设备及介质
CN110751234B (zh) * 2019-10-09 2024-04-16 科大讯飞股份有限公司 Ocr识别纠错方法、装置及设备
CN111090996B (zh) * 2019-12-02 2023-07-14 东软集团股份有限公司 一种分词的方法、装置及存储介质
CN113111656B (zh) * 2020-01-13 2023-10-31 腾讯科技(深圳)有限公司 实体识别方法、装置、计算机可读存储介质和计算机设备
CN113435194B (zh) * 2021-06-22 2023-07-21 中国平安人寿保险股份有限公司 词汇切分方法、装置、终端设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140309986A1 (en) * 2013-04-11 2014-10-16 Microsoft Corporation Word breaker from cross-lingual phrase table
CN104375989A (zh) * 2014-12-01 2015-02-25 国家电网公司 自然语言文本关键词关联网络构建系统
CN104899190A (zh) * 2015-06-04 2015-09-09 百度在线网络技术(北京)有限公司 分词词典的生成方法和装置及分词处理方法和装置
CN106610937A (zh) * 2016-09-19 2017-05-03 四川用联信息技术有限公司 一种基于信息论的中文自动分词算法
CN108038103A (zh) * 2017-12-18 2018-05-15 北京百分点信息科技有限公司 一种对文本序列进行分词的方法、装置和电子设备

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5377281A (en) * 1992-03-18 1994-12-27 At&T Corp. Knowledge-based character recognition
JP2001249922A (ja) * 1999-12-28 2001-09-14 Matsushita Electric Ind Co Ltd 単語分割方式及び装置
AUPR824601A0 (en) * 2001-10-15 2001-11-08 Silverbrook Research Pty. Ltd. Methods and system (npw004)
JP4652737B2 (ja) * 2004-07-14 2011-03-16 インターナショナル・ビジネス・マシーンズ・コーポレーション 単語境界確率推定装置及び方法、確率的言語モデル構築装置及び方法、仮名漢字変換装置及び方法、並びに、未知語モデルの構築方法、
EP1675019B1 (en) * 2004-12-10 2007-08-01 International Business Machines Corporation System and method for disambiguating non diacritized arabic words in a text
CN101155182A (zh) * 2006-09-30 2008-04-02 阿里巴巴公司 一种基于网络的垃圾信息过滤方法和装置
KR101465770B1 (ko) * 2007-06-25 2014-11-27 구글 인코포레이티드 단어 확률 결정
CN101158969B (zh) * 2007-11-23 2010-06-02 腾讯科技(深圳)有限公司 一种整句生成方法及装置
KR101496885B1 (ko) * 2008-04-07 2015-02-27 삼성전자주식회사 문장 띄어쓰기 시스템 및 방법
US20090326916A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Unsupervised chinese word segmentation for statistical machine translation
US20110161072A1 (en) * 2008-08-20 2011-06-30 Nec Corporation Language model creation apparatus, language model creation method, speech recognition apparatus, speech recognition method, and recording medium
US9141867B1 (en) * 2012-12-06 2015-09-22 Amazon Technologies, Inc. Determining word segment boundaries
CN103678282B (zh) * 2014-01-07 2016-05-25 苏州思必驰信息科技有限公司 一种分词方法及装置
CN104156349B (zh) * 2014-03-19 2017-08-15 邓柯 基于统计词典模型的未登录词发现和分词系统及方法
US20160162467A1 (en) * 2014-12-09 2016-06-09 Idibon, Inc. Methods and systems for language-agnostic machine learning in natural language processing using feature extraction
WO2016144963A1 (en) * 2015-03-10 2016-09-15 Asymmetrica Labs Inc. Systems and methods for asymmetrical formatting of word spaces according to the uncertainty between words
CN105426539B (zh) * 2015-12-23 2018-12-18 成都云数未来信息科学有限公司 一种基于词典的lucene中文分词方法
US10679008B2 (en) * 2016-12-16 2020-06-09 Microsoft Technology Licensing, Llc Knowledge base for analysis of text
US10713519B2 (en) * 2017-06-22 2020-07-14 Adobe Inc. Automated workflows for identification of reading order from text segments using probabilistic language models

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140309986A1 (en) * 2013-04-11 2014-10-16 Microsoft Corporation Word breaker from cross-lingual phrase table
CN104375989A (zh) * 2014-12-01 2015-02-25 国家电网公司 自然语言文本关键词关联网络构建系统
CN104899190A (zh) * 2015-06-04 2015-09-09 百度在线网络技术(北京)有限公司 分词词典的生成方法和装置及分词处理方法和装置
CN106610937A (zh) * 2016-09-19 2017-05-03 四川用联信息技术有限公司 一种基于信息论的中文自动分词算法
CN108038103A (zh) * 2017-12-18 2018-05-15 北京百分点信息科技有限公司 一种对文本序列进行分词的方法、装置和电子设备

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
JIANG JIANHONG ET AL: "Analysis and application of Chinese word segmentation model which consist of dictionary and statistics method", COMPUTER ENGINEERING AND DESIGN, vol. 33, no. 1, 31 January 2012 (2012-01-31), pages 387 - 391, XP055691070, ISSN: 1000-7024, DOI: :10.16208/j.issn1000-7024.2012.01.034 *

Also Published As

Publication number Publication date
CN109190124B (zh) 2019-11-26
US20210042470A1 (en) 2021-02-11
CN109190124A (zh) 2019-01-11

Similar Documents

Publication Publication Date Title
WO2020052069A1 (zh) 用于分词的方法和装置
JP7122341B2 (ja) 翻訳品質を評価するための方法と装置
CN113962315B (zh) 模型预训练方法、装置、设备、存储介质以及程序产品
CN107491534B (zh) 信息处理方法和装置
US10176804B2 (en) Analyzing textual data
US11132518B2 (en) Method and apparatus for translating speech
CN107273503B (zh) 用于生成同语言平行文本的方法和装置
US10755048B2 (en) Artificial intelligence based method and apparatus for segmenting sentence
US10630798B2 (en) Artificial intelligence based method and apparatus for pushing news
CN110019782B (zh) 用于输出文本类别的方法和装置
CN109543058B (zh) 用于检测图像的方法、电子设备和计算机可读介质
CN111428010B (zh) 人机智能问答的方法和装置
US20200075024A1 (en) Response method and apparatus thereof
JP7301922B2 (ja) 意味検索方法、装置、電子機器、記憶媒体およびコンピュータプログラム
US11699074B2 (en) Training sequence generation neural networks using quality scores
WO2018045646A1 (zh) 基于人工智能的人机交互方法和装置
CN109241286B (zh) 用于生成文本的方法和装置
WO2020103899A1 (zh) 用于生成图文信息的方法和用于生成图像数据库的方法
JP7266683B2 (ja) 音声対話に基づく情報検証方法、装置、デバイス、コンピュータ記憶媒体、およびコンピュータプログラム
CN109766418B (zh) 用于输出信息的方法和装置
CN109582825B (zh) 用于生成信息的方法和装置
WO2020052061A1 (zh) 用于处理信息的方法和装置
CN110019948B (zh) 用于输出信息的方法和装置
CN107766498B (zh) 用于生成信息的方法和装置
CN110647613A (zh) 一种课件构建方法、装置、服务器和存储介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25.06.2021)

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18933619

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 18933619

Country of ref document: EP

Kind code of ref document: A1