WO2024007827A1 - 文本分词方法、装置、计算机设备及存储介质 - Google Patents

文本分词方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2024007827A1
WO2024007827A1 PCT/CN2023/100021 CN2023100021W WO2024007827A1 WO 2024007827 A1 WO2024007827 A1 WO 2024007827A1 CN 2023100021 W CN2023100021 W CN 2023100021W WO 2024007827 A1 WO2024007827 A1 WO 2024007827A1
Authority
WO
WIPO (PCT)
Prior art keywords
breakpoint
text
adjacent characters
value
processed
Prior art date
Application number
PCT/CN2023/100021
Other languages
English (en)
French (fr)
Inventor
李长林
肖冰
曹磊
罗奇帅
Original Assignee
马上消费金融股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 马上消费金融股份有限公司 filed Critical 马上消费金融股份有限公司
Priority to EP23834601.9A priority Critical patent/EP4379599A1/en
Publication of WO2024007827A1 publication Critical patent/WO2024007827A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the present application relates to the field of natural language processing, and in particular, to a text segmentation method, device, computer equipment and storage medium.
  • Word segmentation of text is an important step in the natural language processing process.
  • deep natural language processing processes such as personalized recommendations, sentiment analysis, topic classification, public opinion analysis, etc.
  • high-accuracy word segmentation results are required as a prerequisite.
  • the purpose of the embodiments of the present application is to provide a text segmentation method, device, computer equipment and storage medium to solve the problem of low word segmentation efficiency of the word segmentation method.
  • this application provides a text segmentation method, which includes: obtaining text to be segmented, preprocessing the text to be segmented, and obtaining processed text; identifying breakpoint positions in the processed text;
  • the breakpoint position identification includes: calculating the probability value of a breakpoint between two adjacent characters in the processed text, and identifying the breakpoint in the processed text according to the breakpoint probability value. point position; perform word segmentation processing on the text to be segmented according to the identified breakpoint position.
  • this application provides a text segmentation device, including: a preprocessing unit, used to obtain text to be segmented, preprocess the text to be segmented, and obtain processed text; breakpoint recognition A separate unit is used to identify breakpoint locations in the processed text; the breakpoint location identification includes: calculating the possible degree value of a breakpoint between two adjacent characters in the processed text, according to The breakpoint possibility value is used to identify the breakpoint position in the processed text; the word segmentation processing unit is used to perform word segmentation processing on the text to be segmented according to the identified breakpoint position.
  • an embodiment of the present application provides a computer device, the device includes: a processor; and a memory arranged to store computer-executable instructions, the computer-executable instructions being configured to be executed by the processor, The computer-executable instructions are used to perform the steps of the method as described in the first aspect.
  • an embodiment of the present application provides a storage medium, wherein the storage medium is used to store computer-executable instructions, and the computer-executable instructions cause the computer to perform the steps in the method described in the first aspect. .
  • the text to be segmented is obtained, the text to be segmented is preprocessed, and the processed text is obtained. Then, the breakpoint position is identified in the processed text.
  • the breakpoint position identification includes: calculation There is a breakpoint possibility value between two adjacent characters in the processed text. According to the breakpoint possibility value, the breakpoint position is identified in the processed text. Finally, according to the identified breakpoint position, the breakpoint is treated Word segmentation text is processed by word segmentation.
  • complex parameters such as left and right adjacency entropy, adjacency change number, etc. during word segmentation, so the efficiency of word segmentation can be improved.
  • Figure 1 is a schematic flow chart of a text segmentation method provided by an embodiment of the present application.
  • Figure 2 is a flow for detecting whether there is a breakpoint between two adjacent characters provided by an embodiment of the present application. Process diagram;
  • Figure 3 is a schematic flowchart of detecting whether there is a breakpoint between two adjacent characters provided by another embodiment of the present application.
  • Figure 4 is a schematic structural diagram of a text segmentation device provided by an embodiment of the present application.
  • Figure 5 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • word segmentation can be performed based on statistical methods.
  • the main principle of this method is to use a large amount of experimental corpus to calculate the word frequency, word formation probability, left and right adjacency entropy, adjacency change number and other statistical features to identify words in the text.
  • the calculation process is complicated, resulting in low efficiency of word segmentation.
  • there is a limit on the length of the word after word segmentation Usually, below, the word after segmentation contains at most four characters, and words containing five or more characters cannot be obtained through segmentation.
  • an embodiment of the present application provides a text segmentation method.
  • the main idea of this method is: first, obtain the text to be segmented , preprocess the text to be segmented to obtain the processed text, and then perform breakpoint position identification in the processed text.
  • the breakpoint position identification includes: calculating the breakpoint between two adjacent characters in the processed text. The first possible degree value, and/or, calculate the second possible degree value that there is no breakpoint between two adjacent characters in the processed text, and identify the breakpoint position in the processed text according to the calculation result, Finally, the text to be segmented is segmented based on the identified breakpoint positions.
  • Figure 1 is a schematic flow chart of a text segmentation method provided by an embodiment of the present application. As shown in Figure 1, the process includes the following steps:
  • Step S102 Obtain the text to be segmented, preprocess the text to be segmented, and obtain the processed text.
  • Step S104 perform breakpoint position identification in the processed text; breakpoint position identification includes: calculating the breakpoint possibility value between two adjacent characters in the processed text, and based on the breakpoint possibility degree The value identifies the breakpoint location in the processed text.
  • Step S106 Perform word segmentation processing on the text to be segmented based on the identified breakpoint positions.
  • the text to be segmented is obtained, the text to be segmented is preprocessed, and the processed text is obtained. Then, the breakpoint position is identified in the processed text.
  • the breakpoint position identification includes: calculation A breakpoint possibility value between two adjacent characters in the processed text, and a breakpoint position in the processed text is identified based on the breakpoint possibility value. .
  • complex parameters such as left and right adjacency entropy, adjacency change number, etc. during word segmentation, so the efficiency of word segmentation can be improved.
  • Breakpoints are used for judgment, and there is no limit to the length of the word after word segmentation, so words of any length can be obtained by word segmentation. Therefore, through this embodiment, it is possible to solve the problem that the existing word segmentation method has low word segmentation efficiency and limits the length of the segmented words.
  • the text segmentation method in this embodiment can be executed by a server specialized in natural language processing.
  • the text to be segmented is obtained.
  • the text to be segmented can be obtained from a text collection to be segmented.
  • the text set to be segmented includes a collection of a large number of sentences, and the text to be segmented can be any sentence in the text set to be segmented.
  • the text to be segmented is also pre-processed to obtain processed text.
  • preprocessing such as setting the text format of the text to be segmented to the default format to facilitate text segmentation, or removing punctuation marks from the text to be segmented to facilitate text segmentation.
  • the text to be segmented is preprocessed to obtain the processed text, specifically To: Match the pre-established stop lexicon with the text to be segmented to determine the characters located in the text to be segmented and in the stop lexicon. Based on the determined characters, identify the breakpoint position in the text to be segmented, and The identified text to be segmented is used as the processed text.
  • a stop vocabulary library is established in advance, and the stop vocabulary library includes preset stop words, stop words, punctuation marks, numbers, special symbols, etc.
  • the stop vocabulary library includes preset stop words, stop words, punctuation marks, numbers, special symbols, etc.
  • the pre-established stop lexicon is matched with the text to be segmented to determine the characters that are in the text to be segmented and in the stop lexicon. Since the determined characters are in the stop lexicon, it is equivalent to A word or a single character or a single number or a single symbol that has been segmented. Therefore, the breakpoint position can be identified in the text to be segmented based on the determined character. In the text to be segmented, the position before the determined character and after the determined character can be identified. The position is the breakpoint position. And, the identified text to be segmented is used as the processed text.
  • position symbols can also be inserted into the text to be segmented.
  • the inserted position symbols represent the position of each character in the text to be segmented in the text to be segmented. For example, for the text to be segmented "There is a little duck here.”, insert position symbols before the first character in the text to be segmented and after each character in the text to be segmented, and get:
  • the breakpoint position is identified in the text to be segmented as follows: when the determined character is a stop word, in the text to be segmented, the position character before the stop word is represented by The position represented by the position and the position character after the word is used as the breakpoint position; when the determined character is a stop word, punctuation mark, number or special symbol, in the text to be segmented, the position before the determined character is The position represented by the position character and the position represented by the position character after the determined character are used as the breakpoint position.
  • the determined characters are located in the text to be segmented and in the stop vocabulary library, and in the pre-established stop vocabulary library, not only stop words are recorded, but also stop words, punctuation marks, and numbers are recorded And special symbols, etc., so the determined characters can be divided into two situations.
  • the determined character is a stop word, for example, for the text to be segmented "There is a little duck here.”, in which the determined character is "duck”; in another case, the determined character is a stop word , punctuation, numbers, or Special symbols, for example, for the text to be segmented "Have you eaten?", the determined character is "?".
  • the position represented by the position character before the stop word and the position represented by the position character after the word are used as the breakpoint position.
  • the breakpoint position can be recorded as (5,7).
  • the position represented by the position character before the determined character and the position represented by the position character after the determined character are , as the breakpoint location.
  • the breakpoint position can be recorded as (7,8).
  • the breakpoint position can be initially identified and recorded in the text to be segmented.
  • the breakpoint position can be preliminarily identified in the text to be segmented. Initial identification of breakpoint locations caused by some common stop words, characters or symbols improves the efficiency of breakpoint location identification.
  • step S106 based on the identified breakpoint positions, the text to be segmented is
  • the breakpoint position is based on the breakpoint position obtained during the preprocessing process and the breakpoint position identified in step S104.
  • the recognized text to be segmented as the processed text after using the recognized text to be segmented as the processed text, it also includes recording the position of the above-mentioned determined character in the text to be segmented, so as to identify the breakpoint position in step S104. Skip the characters identified above.
  • the breakpoint position is recorded as the position of the determined character in the text to be segmented, so as to skip the determined character during the breakpoint position identification process in step S104.
  • the position is the position of the determined character in the text to be segmented.
  • the determined character position can be recorded as (5,7).
  • the position represented by the position character before the determined character and the position after the determined character are The position represented by the character is used as the breakpoint position, and the breakpoint position is recorded as the position of the determined character in the text to be segmented, so as to skip the determined character during the breakpoint position identification process in step S104.
  • the breakpoint position is recorded as (7,8).
  • the breakpoint position identified in the text to be segmented is equivalent to the position of the determined character in the text to be segmented.
  • breakpoint position identification is performed in the processed text.
  • Breakpoint position identification includes: calculating the breakpoint possibility value between two adjacent characters in the processed text, and based on the breakpoint The likelihood value identifies breakpoint locations in the processed text.
  • the calculation of a possible degree value of a breakpoint between two adjacent characters in the processed text includes: calculating a first possible degree value of a breakpoint between two adjacent characters in the processed text. .
  • Calculating a breakpoint possibility value between two adjacent characters in the processed text includes: calculating a second possibility degree that there is no breakpoint between two adjacent characters in the processed text. value.
  • the calculation of a possible degree value of a breakpoint between two adjacent characters in the processed text includes: calculating a first possible degree value of a breakpoint between two adjacent characters in the processed text. ; and calculate the second probability value that there is no breakpoint between two adjacent characters in the processed text.
  • step S104 the two adjacent characters involved in the breakpoint position identification refer to any two adjacent characters in the processed text, and the adjacent characters do not include characters determined by deactivating the lexicon.
  • step S104 through a breakpoint position identification process, it is identified whether there is a breakpoint between the two adjacent characters in the processed text.
  • step S104 For example, for the text to be segmented "Have you eaten today?", the character obtained by deactivating the lexicon is determined to be "eat", and "eat" no longer participates in the breakpoint position identification process in step S104. Therefore, in step S104, The two adjacent characters involved in the breakpoint position recognition refer to any two adjacent characters in the processed text "Have you eaten today?", and the adjacent characters do not include "eat”. Two adjacent characters can include "today” and "has”. Among them, "day” and " ⁇ ” are not considered adjacent characters because there is "eat” between them. In step S104, through the breakpoint position identification process, it is identified whether there is a breakpoint between "today” and whether there is a breakpoint between "have you”.
  • step S104 The following describes how to perform breakpoint position identification in step S104 to skip characters determined by disabling thesaurus matching.
  • step S104 a breakpoint possibility value between two adjacent characters in the processed text is calculated, and the breakpoint possibility value is identified in the processed text according to the breakpoint possibility value.
  • Breakpoint location specifically:
  • breakpoint position If the breakpoint position is identified, the breakpoint position will be used as the starting position of the traversal and the first traversal process will be repeated. If the recorded position is traversed, the traversal will continue from the traversed position until the traversal reaches The position of the next record is used as the starting position of the traversal and the first traversal process is repeated.
  • the starting position of the processed text that is, the position before the first character in the processed text
  • the first traversal process is executed.
  • each two adjacent characters are traversed starting from the traversal revelation position, and the first possible degree value of a breakpoint between each two traversed adjacent characters is calculated, and/or, the traversed
  • the second possible value of the degree that there is no breakpoint between every two adjacent characters and based on the calculation result, identify whether there is a breakpoint between every two adjacent characters traversed. In one case, according to the calculation results, it is determined that there is a breakpoint between each two adjacent characters traversed, and then the breakpoint position is determined.
  • the breakpoint position is not traversed, but the previously recorded position of the determined character in the text to be segmented is traversed.
  • traverse to the end position of the processed text and determine the end of the traversal and the end position. is the position after the last character in the processed text.
  • the breakpoint position is used as the traversal starting position and the first traversal process is repeated.
  • the breakpoint position is 1, and position 1 is used as the starting position of the traversal, and the traversal continues backwards.
  • a special case needs to be explained here. Starting from position 1, it is found that the next position is position 2, and there is only one character between position 1 and position 2. Then this character is an independent word segmentation result.
  • the first traversal process includes: traversing every two adjacent characters starting from the traversal starting position, calculating the first possible degree value of a breakpoint between each two traversed adjacent characters, and/or calculating the traversed The second possible value of the breakpoint that does not exist between every two adjacent characters. According to the calculation result, identify whether there is a breakpoint between every two adjacent characters traversed until the breakpoint position is identified or the traversal Go to the recorded position or traverse to the end position of the processed text.
  • the position of the character when recording the position of a character matched through the disabled lexicon, the position of the character can be recorded in a specific format, such as the format is (starting position, ending position), so when traversing to the recorded position , this position is the starting position of the matched character. Continue to traverse backward. The position of the next record traversed is the end position of the matched character. Therefore, the position of the next record traversed is used as the traversed starting position and repeat the first pass.
  • the format starting position, ending position
  • the pre-recorded positions can be skipped through loop traversal, so as to avoid secondary recognition of the pre-matched characters.
  • step S104 a breakpoint possibility value between two adjacent characters in the processed text is calculated, and according to the breakpoint possibility value in the processed text Identify the breakpoint location, specifically:
  • the second traversal process includes: traversing every two adjacent characters starting from the traversal starting position, calculating the traversal to The first possible value of a breakpoint between every two adjacent characters, and/or, calculate the second possible value of a breakpoint without a breakpoint between every two adjacent characters traversed, according to the calculation result, Identify whether there is a breakpoint between each two adjacent characters traversed until the breakpoint position is identified or the end position of the subtext is traversed;
  • the processed text is segmented according to the starting position, ending position and recorded position of the processed text to obtain multiple sub-texts. For example, for the sentence “0 today 1 day 2 eat 3 meals 4 5 6", in which "eat” is a predetermined character, and the corresponding recorded position is (2, 4), then according to the start of the processed text position, end position and recorded position, determine the starting position and end position of each paragraph text, and obtain multiple paragraphs of text based on the starting position and end position of each paragraph text, among which, in each paragraph text obtained in segments, no Includes predetermined characters. For example, in this example, multiple sub-texts are obtained through segmentation, namely "0 today 1 day 2" and "4 5 6".
  • the starting position of the sub-text is used as the starting position of the traversal and the second traversal process is performed.
  • the end position of the sub-text is determined, and the end position is the last word in the sub-text. the position after the symbol.
  • the processed text can be segmented according to the starting position, the ending position and the recorded position of the processed text, to obtain multiple sub-texts, and jump within each sub-text by looping. Pass the pre-recorded position to avoid secondary recognition of pre-matched characters.
  • each sub-text there is no need to perform a loop traversal, and the first possible degree value of a breakpoint between every two adjacent characters in the sub-text can be calculated, and/or, the value within the sub-text can be calculated.
  • the second possible degree value of a breakpoint that does not exist between every two adjacent characters, and based on the calculation result, the breakpoint position is identified within the subtext.
  • the second traversal process includes: traversing every two adjacent characters starting from the traversal starting position, calculating the first possible degree value of a breakpoint between each two traversed adjacent characters, and/or calculating the traversed The second possible value of the degree that there is no breakpoint between every two adjacent characters, and based on the calculation result, identify whether there is a breakpoint between every two adjacent characters traversed. In one case, according to the calculation results, it is determined that there is a breakpoint between each two adjacent characters traversed, and then the breakpoint position is determined.
  • the above actions (a1) and the above actions (b2) both involve the process of calculating the first possible degree value and/or the second possible degree value, and identifying whether there is a breakpoint between two adjacent characters based on the calculation results.
  • This process In step S104, calculate the breakpoint possibility value between two adjacent characters in the processed text, and identify the breakpoint position in the processed text based on the breakpoint possibility value. Similar.
  • step S104 calculate the breakpoint possibility value between two adjacent characters in the processed text, and identify the breakpoint position in the processed text according to the breakpoint possibility value.
  • the process is introduced. For specific implementation details of the above action (a1) and the above action (b2), please refer to the following description.
  • step S104 calculate the breakpoint possibility value between two adjacent characters in the processed text, and identify the breakpoint position in the processed text according to the breakpoint possibility value, specifically including: :
  • (c1) In the processed text, calculate the first possible degree value of a breakpoint between two adjacent characters; if the first possible degree value is less than or equal to the first preset threshold, determine the two adjacent characters There is no breakpoint between two adjacent characters. If the first possibility value is greater than the first preset threshold, then the second possibility value that there is no breakpoint between two adjacent characters is calculated; if the second possibility value is less than the second preset threshold, then the second possibility value is calculated. If a threshold is set, it is determined that a breakpoint exists between two adjacent characters. If the second possibility value is greater than or equal to the second preset threshold, it is determined that there is no breakpoint between two adjacent characters.
  • (c2) in the processed text calculate the second possible degree value that there is no breakpoint between two adjacent characters; if the second possible degree value is greater than or equal to the second preset threshold, determine the two There is no breakpoint between adjacent characters. If the second possibility value is less than the second preset threshold, then the first possibility value of a breakpoint between two adjacent characters is calculated; if the first possibility value is greater than the second If a preset threshold is used, it is determined that there is a breakpoint between two adjacent characters. If the first possibility value is less than or equal to the first preset threshold, it is determined that there is no breakpoint between two adjacent characters.
  • Figure 2 is a schematic flowchart of detecting whether there is a breakpoint between two adjacent characters provided by an embodiment of the present application. As shown in Figure 2, the process includes:
  • Step S202 Calculate the first possible degree value of a breakpoint between two adjacent characters.
  • Step S204 Determine whether the first possibility value is greater than the first preset threshold.
  • Step S206 Calculate the second possibility value that there is no breakpoint between two adjacent characters.
  • Step S208 determine whether the second possibility value is less than the second preset threshold
  • Step S210 Determine that there is a breakpoint between two adjacent characters.
  • Step S212 Determine that there is no breakpoint between two adjacent characters.
  • Figure 3 is a schematic flowchart of detecting whether there is a breakpoint between two adjacent characters provided by another embodiment of the present application. As shown in Figure 3, the process includes:
  • Step S302 Calculate the second possibility value that there is no breakpoint between two adjacent characters.
  • Step S304 Determine whether the second possibility value is less than the second preset threshold.
  • Step S306 Calculate the first possible degree value of a breakpoint between two adjacent characters.
  • Step S308 Determine whether the first possibility value is greater than the first preset threshold.
  • Step S310 Determine that there is a breakpoint between two adjacent characters.
  • Step S312 Determine that there is no breakpoint between two adjacent characters.
  • a first preset threshold is set for the first possible degree value
  • a second preset threshold is set for the second possible degree value.
  • the first preset threshold and the second preset threshold are set for the second possible degree value.
  • the threshold can be set as needed.
  • the first possible degree value is essentially based on the existence of a breakpoint between two adjacent characters, and/or, each The second possible degree value of no breakpoint between two adjacent characters, detects whether there is a breakpoint between two adjacent characters.
  • the first possible degree value of the existence of a breakpoint between the two adjacent characters is calculated, specifically:
  • the processed text comes from the text to be segmented, and the text to be segmented can come from the text collection to be segmented. Therefore, in this step, the preset text library can be the text collection to be segmented. Of course, the preset text library can also be Other pre-built text libraries include large amounts of text.
  • step (d1) the first number of occurrences of each of the two adjacent characters in the preset text library is obtained. Furthermore, the second number of occurrences of two adjacent characters appearing adjacently in the preset text library is obtained. Among them, when obtaining the first number of occurrences, it includes the situation where one of the two adjacent characters appears adjacent to the other of the two adjacent characters. That is to say, the first number of occurrences includes the second occurrence. frequency. It can be understood that the second number of occurrences is actually the number of occurrences of two adjacent characters appearing as one word in the preset text library.
  • step (d2) based on the first number of occurrences and the second number of occurrences corresponding to each character in the two adjacent characters, the first possible degree value of the existence of a breakpoint between the two adjacent characters is calculated.
  • step (d2) may be:
  • step (d21) the first occurrence times corresponding to each of the two adjacent characters are multiplied to obtain a product. Furthermore, the product is divided by the second number of occurrences of two adjacent characters as one word in the preset text library to obtain a ratio. In step (d22), based on the ratio, the first possible degree value of a breakpoint between two adjacent characters is determined.
  • the processed text obtained is "There is a little duck".
  • the first possible value of the breakpoint between "There” and “A” is calculated.
  • the specific calculation is The process is:
  • f(a) represents the number of occurrences of character a in the text set to be segmented, that is, the first occurrence number
  • f(b) represents the number of occurrences of character b in the text set to be segmented, that is, the first occurrence number
  • f(ab) respectively represents the number of times characters a and b appear adjacently in the text set to be segmented, that is, the second number of occurrences.
  • a can mean "have”
  • b can mean "a”
  • ab can mean "have”.
  • the calculation result of formula (1) is the first possible degree value.
  • the first occurrence number of each character in the preset text library of two adjacent characters can be used, and the number of occurrences of two adjacent characters in the preset text library.
  • the second number of occurrences is to calculate the first possible value of a breakpoint between two adjacent characters. The calculation process is simple and easy to implement.
  • the second possible degree value of no breakpoint between the two adjacent characters is calculated, specifically:
  • step (e1) remove any one of the two adjacent characters from the processed text to obtain the first text, and remove the other of the two adjacent characters from the processed text to obtain the second text. text.
  • step (e2) based on the processed text, the first text and the second text, a second possible degree value of no breakpoint between two adjacent characters is calculated.
  • the second possible degree value that there is no breakpoint between two adjacent characters is calculated, specifically:
  • the processed text obtained is "There is a little duck”.
  • calculate the second possible degree value without a breakpoint between "There” and "A” The specific calculation process is:
  • the first distance and the second distance may be Euclidean distance, cosine distance, etc.
  • the vectorization methods include but are not limited to TF-IDF, word2vec, glove, ELMo, BERT, etc.
  • the first text and the second text calculate the distance between two adjacent characters
  • the process at the second possible degree value of the breakpoint can be expressed by the following formula (2).
  • text_vec represents the vector of the processed text
  • text_a_vec represents the vector of the first text
  • text_b_vec represents the vector of the second text.
  • d(text_vec, text_a_vec) represents the first vector distance between the processed text and the first text
  • d(text_vec, text_b_vec) represents the second vector distance between the processed text and the second text That is the second distance.
  • the first text and the second text can be obtained by removing characters from the processed text. According to the first distance between the processed text and the first text and the processed text The second distance between the text and the second text determines the second possible degree value of no breakpoint between two adjacent characters.
  • the calculation process is simple and easy to implement.
  • the processed text obtained is "There is a little duck”.
  • step S104 the sum of "There is” is first calculated through the above steps (d1) and (d2). There is a first possible degree value of a breakpoint between "number”. If the first possible degree value is less than or equal to the first preset threshold, then it is determined that there is no breakpoint between "there” and “number”. If the first possible degree value is less than or equal to the first preset threshold, then it is determined that there is no breakpoint between "there” and "number”.
  • a possible degree value is greater than the first preset threshold, then through the above steps (e1) and (e2), calculate the second possible degree value without a breakpoint between "have” and "a", if the second possible degree value If the degree value is less than the second preset threshold, it is determined that there is a breakpoint between "there” and "a”. If the second possible degree value is greater than or equal to the second preset threshold, it is determined that there is a breakpoint between "there” and "a” There is no breakpoint in between.
  • step S104 whether there is a breakpoint between two adjacent characters is identified, so that the breakpoint position is identified and recorded in the processed text.
  • step S106 word segmentation processing is performed on the text to be segmented based on the recognized breakpoint position.
  • the identified breakpoint position used in step S106 includes the breakpoint position identified in step S104, and also includes the breakpoint position recorded during the preprocessing process in step S102.
  • the position before the first character and the position after the last character in the text to be segmented can also be recorded as breakpoint positions.
  • the text to be segmented is "There is a little duck here.” After inserting the position symbol, it is "0, 1, 2, there are 3, 4, 5, duck, 6, 7". Through step S102, it is determined that the character "duck” is obtained, and the breakpoint positions "5 and 7" are recorded, and the breakpoint positions 5 and 7 are recorded as the determined character positions.
  • step S104 the text is segmented according to positions 5 and 7 to obtain a sub-text.
  • the subtext start traversing from the starting position 0 to get “here”, and calculate the first possible degree value of a breakpoint between "here” and “in” and the second possible degree value of no breakpoint. Then determine that there is no breakpoint between "here”, then traverse to "in you”, and calculate the first possible degree value of the existence of a breakpoint between "in” and “you” and the second possible degree value of the non-existent breakpoint. value, and then determine that there is a breakpoint between "there”, and then record the breakpoint position 2.
  • the first possible degree value and the second possible degree value without a breakpoint determine that there is a breakpoint between "there is”, then record the breakpoint position 3, continue traversing from position 3, traverse to "a small”, calculate " The first possible degree value of a breakpoint between "pieces” and “small” and the second possible degree value of no breakpoint are determined to determine the existence of a breakpoint between "pieces of small”, and then record the breakpoint position 4, and then, from The traversal continues starting at position 4, traversing to position 5, confirming the end of the traversal, and finally obtaining the breakpoint positions "2, 3, 4".
  • the breakpoint position has three different sources, and among these three sources, there are overlapping positions, such as the above-mentioned position 7.
  • the recorded breakpoint position can also be deduplicated to obtain the final breakpoint position. 0, 2, 3, 4, 5, 7.
  • step S104 when performing breakpoint identification and recording the breakpoint position of the subtext "0 in 1, 2 has 3 4s and 5s", when traversing to position 2 and determining that a breakpoint exists,
  • the breakpoint positions can be recorded as 0 and 2.
  • breakpoint positions 2 and 3 can be recorded. That is, when the breakpoint is traversed, both the starting position and the breakpoint position will be traversed. Record it, and thus obtain multiple breakpoint positions as 0, 2, 2, 3, 3, 4, 4, and 5. Based on this, remove duplicates from the multiple breakpoint positions obtained and obtain breakpoint positions 0, 2, 3, 4, 5.
  • the breakpoint position obtained here is merged with the breakpoint position obtained in step S102 and the first and last positions of the text to be segmented to obtain the final breakpoint position.
  • step S106 the text to be segmented is segmented at the recorded breakpoint position. Specifically, a delimiter is inserted at the breakpoint position of the record to segment the text to be segmented. For example, the word segmentation result is obtained:
  • any breakpoint symbol can be used to segment the text to be segmented. There is no need to calculate complex parameters during the segmentation process.
  • the implementation is simple and the segmentation efficiency is high.
  • This text segmentation method is highly transferable and versatile, and can be applied to various scenarios. For example, it is applied to the field of new word recognition, and new words are discovered by counting word frequencies based on word segmentation.
  • Figure 4 is a schematic structural diagram of a text segmentation device provided by an embodiment of the present application. As shown in Figure 4, the device includes:
  • the preprocessing unit 41 is used to obtain text to be segmented, preprocess the text to be segmented, and obtain processed text.
  • the breakpoint identification unit 42 is used to calculate the breakpoint possibility value between two adjacent characters in the processed text, and identify breakpoints in the processed text based on the breakpoint possibility value. Location.
  • the word segmentation processing unit 43 is configured to perform word segmentation processing on the text to be segmented according to the identified breakpoint positions.
  • the breakpoint identification unit 42 is specifically configured to calculate the first possible degree value of the existence of a breakpoint between two adjacent characters in the processed text.
  • the breakpoint identification unit 42 is specifically configured to calculate a second possible degree value that there is no breakpoint between two adjacent characters in the processed text.
  • the breakpoint identification unit 42 is specifically configured to calculate the first possible degree value of a breakpoint between two adjacent characters in the processed text; and calculate the first possible degree value of a breakpoint between two adjacent characters in the processed text. The second most likely value for which there is no breakpoint between characters.
  • the preprocessing unit 41 is specifically configured to: match a pre-established stop word library with the text to be segmented to determine the characters located in the text to be segmented and in the stop word library; According to the determined characters, a breakpoint position is identified in the text to be segmented, and the identified text to be segmented is used as the processed text.
  • a position recording unit 44 configured to: after obtaining the processed text, record the position of the determined character in the text to be segmented, so as to identify the past at the breakpoint position. The specified characters are skipped during the process.
  • the breakpoint identification unit 42 is specifically configured to use the starting position of the processed text as the traversal starting position and perform the first traversal process.
  • the breakpoint position is used as the traversal starting position and the first traversal process is repeatedly executed. If the recorded position is traversed, the traversal continues from the traversed position. Until the position of the next record is traversed, the position of the next record is used as the traversal starting position and the first traversal process is repeatedly executed.
  • the first traversal process includes: traversing every two adjacent characters starting from the traversal starting position, and calculating the first possible degree value of a breakpoint between each two traversed adjacent characters. , and/or, calculate the second possible degree value that there is no breakpoint between each two adjacent characters traversed, and according to the calculation result, identify whether there is a breakpoint between each two adjacent characters traversed, until Identify the breakpoint position or traverse to the recorded position or traverse to the end position of the processed text.
  • the breakpoint identification unit 42 is specifically used to:
  • For each sub-text use the starting position of the sub-text as the traversal starting position and perform the second traversal process;
  • breakpoint position is used as the traversal starting position and the second traversal process is repeatedly executed.
  • the second traversal process includes: traversing every two adjacent characters starting from the traversal starting position, and calculating the first possible degree value of a breakpoint between each two traversed adjacent characters, And/or, calculate the second possible degree value that there is no breakpoint between each two adjacent characters traversed, and according to the calculation result, identify whether there is a breakpoint between each two adjacent characters traversed until the identification Go to the breakpoint or traverse to the end of the subtext.
  • the breakpoint identification unit 42 is specifically used to:
  • the processed text calculate the first possible degree value that there is a breakpoint between two adjacent characters; if the first possible degree value is less than or equal to the first preset threshold, then determine the two adjacent characters There is no breakpoint between characters. If the first possibility value is greater than the first preset threshold, then calculate the second possibility value that there is no breakpoint between two adjacent characters; if the second possibility value If the value is less than the second preset threshold, it is determined that there is a breakpoint between two adjacent characters. If the second possibility value is greater is equal to or equal to the second preset threshold, it is determined that there is no breakpoint between two adjacent characters;
  • the second possible degree value that there is no breakpoint between two adjacent characters calculates the second possible degree value that there is no breakpoint between two adjacent characters; if the second possible degree value is greater than or equal to the second preset threshold, determine the two corresponding There is no breakpoint between adjacent characters. If the second possibility value is less than the second preset threshold, then the first possibility value of a breakpoint between two adjacent characters is calculated; if the first possibility value If the value is greater than the first preset threshold, it is determined that there is a breakpoint between two adjacent characters. If the first possibility value is less than or equal to the first preset threshold, it is determined that there is no breakpoint between two adjacent characters. point.
  • the breakpoint identification unit 42 is also specifically used to:
  • a first possible degree value of the existence of a breakpoint between the two adjacent characters is calculated.
  • the breakpoint identification unit 42 is also specifically used to:
  • a second possible degree value without a breakpoint between two adjacent characters is calculated.
  • the breakpoint identification unit 42 is also specifically used to:
  • a first possible degree value of a breakpoint between two adjacent characters is determined.
  • the breakpoint identification unit 42 is also specifically used to:
  • the average value of the first distance and the second distance is used as the second possible degree value that there is no breakpoint between two adjacent characters.
  • the text segmentation device in this embodiment can implement each process of the aforementioned text segmentation method embodiment and achieve the same effects and functions, which will not be repeated here.
  • An embodiment of the present application also provides a computer device.
  • the device may specifically be the above-mentioned database log parsing device, used to perform the above-mentioned data verification method.
  • Figure 5 is an embodiment of the present application.
  • the structural schematic diagram of the provided computer equipment is shown in Figure 5.
  • Computer equipment may vary greatly due to different configurations or performance, and may include one or more processors 1001 and memory 1002, and the memory 1002 may store one or more storage application programs or data. Among them, the memory 1002 can be a short-term storage or a persistent storage.
  • Application programs stored in memory 1002 may include one or more modules (not shown), and each module may include a series of computer-executable instructions on a computer device.
  • the processor 1001 may be configured to communicate with the memory 1002 and execute a series of computer-executable instructions in the memory 1002 on the computer device.
  • the computer device may also include one or more power supplies 1003, one or more wired or wireless network interfaces 1004, one or more input/output interfaces 1005, one or more keyboards 1006, etc.
  • a computer device includes: a processor; and a memory arranged to store computer-executable instructions, the computer-executable instructions being configured to be executed by the processor to implement the following process:
  • Breakpoint location identification is performed in the processed text; the breakpoint location identification includes:
  • the text to be segmented is subjected to word segmentation processing.
  • An embodiment of the present application also provides a storage medium for storing computer-executable instructions.
  • the storage medium can be a USB disk, an optical disk, a hard disk, etc.
  • Breakpoint location identification is performed in the processed text; the breakpoint location identification includes:
  • the text to be segmented is subjected to word segmentation processing.
  • the storage medium in this embodiment can implement the aforementioned text segmentation method.
  • Each process of the embodiment achieves the same effect and function, and will not be repeated here.
  • embodiments of the present application may be provided as methods, systems or computer program products. Therefore, embodiments of the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment that combines software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-readable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-readable storage media including, but not limited to, disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory that causes a computer or other programmable data processing apparatus to operate in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction means, the instructions
  • the device implements the functions specified in a process or processes of the flowchart and/or a block or blocks of the block diagram.
  • These computer program instructions may also be loaded onto a computer or other programmable data processing device, causing a series of operating steps to be performed on the computer or other programmable device to produce computer-implemented processing, thereby executing on the computer or other programmable device.
  • Instructions provide steps for implementing the functions specified in a process or processes of a flowchart diagram and/or a block or blocks of a block diagram.
  • a computing device includes one or more processors (Central Processing Unit, CPU), input/output interfaces, network interfaces, and memory.
  • processors Central Processing Unit, CPU
  • input/output interfaces input/output interfaces
  • network interfaces network interfaces
  • memory volatile and non-volatile memory
  • Memory may include non-volatile storage in computer-readable media, random access memory (R andom Access Memory (RAM) and/or non-volatile memory, such as read-only memory (Read-Only Memory, ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media includes both persistent and non-volatile, removable and non-removable media that can be implemented by any method or technology for storage of information.
  • Information may be computer-readable instructions, data structures, modules of programs, or other data.
  • Examples of computer storage media include, but are not limited to, phase-change RAM (PRAM), static random access memory (Static RAM, SRAM), dynamic random access memory (Dynamic RAM, DRAM), and other types Random access memory (RAM), read-only memory (ROM), electrically erasable programmable ROM (EEPROM), flash memory or other memory technology, CD-ROM (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, magnetic tape cassettes, disk storage or other magnetic storage devices, or any other non-transmission medium, may be used to store information that can be accessed by a computing device.
  • computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
  • Embodiments of the present application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer.
  • program modules include routines, programs, objects, components, data structures, etc. that perform specific tasks or implement specific abstract data types.
  • One or more embodiments of the present application may also be practiced in distributed computing environments where tasks are performed by remote processing devices connected through a communications network.
  • program modules may be located in both local and remote computer storage media including storage devices.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

本申请一实施例提供了一种文本分词方法、装置、计算机设备及存储介质,其中方法包括:获取待分词文本,对所述待分词文本进行预处理,得到处理后的文本;在所述处理后的文本中进行断点位置识别;所述断点位置识别包括:计算所述处理后的文本中两个相邻字符之间的断点可能程度值,根据所述断点可能程度值,在所述处理后的文本中识别断点位置;根据识别到的断点位置,对所述待分词文本进行分词处理。通过本实施例,能够解决现有的分词方法分词效率低的问题。

Description

文本分词方法、装置、计算机设备及存储介质
交叉引用
本申请要求在2022年07月07日提交中国专利局、申请号为202210795690.9、名称为“文本分词方法、装置、计算机设备及存储介质”的中国专利申请的优先权,该申请的全部内容通过引用结合在本申请中。
技术领域
本申请涉及自然语言处理领域,尤其涉及一种文本分词方法、装置、计算机设备及存储介质。
背景技术
对文本进行分词是自然语言处理过程中的一个重要环节。在深层次的自然语言处理过程中,比如在个性化推荐、情感分析、主题分类、舆情分析等过程中,都需要准确率高的分词结果作为前提。
发明内容
本申请实施例的目的是提供一种文本分词方法、装置、计算机设备及存储介质,解决分词方法分词效率低的问题。
为了实现上述技术方案,本申请是这样实现的:
一方面,本申请提供的一种文本分词方法,包括:获取待分词文本,对所述待分词文本进行预处理,得到处理后的文本;在所述处理后的文本中进行断点位置识别;所述断点位置识别包括:计算所述处理后的文本中两个相邻字符之间存在断点的可能程度值,根据所述断点可能程度值,在所述处理后的文本中识别断点位置;根据识别到的断点位置,对所述待分词文本进行分词处理。
一方面,本申请提供的一种文本分词装置,包括:预处理单元,用于获取待分词文本,对所述待分词文本进行预处理,得到处理后的文本;断点识 别单元,用于在所述处理后的文本中进行断点位置识别;所述断点位置识别包括:计算所述处理后的文本中两个相邻字符之间存在断点可能程度值,根据所述断点可能程度值,在所述处理后的文本中识别断点位置;分词处理单元,用于根据识别到的断点位置,对所述待分词文本进行分词处理。
一方面,本申请一实施例提供的一种计算机设备,所述设备包括:处理器;以及被安排成存储计算机可执行指令的存储器,所述计算机可执行指令被配置由所述处理器执行,所述计算机可执行指令用于执行如第一方面中所述的方法中的步骤。
一方面,本申请一实施例提供的一种存储介质,其中,所述存储介质用于存储计算机可执行指令,所述计算机可执行指令使得计算机执行如第一方面中所述的方法中的步骤。
可以看出,本实施例中,首先,获取待分词文本,对待分词文本进行预处理,得到处理后的文本,然后,在处理后的文本中进行断点位置识别,断点位置识别包括:计算处理后的文本中两个相邻字符之间存在断点可能程度值,根据所述断点可能程度值,在处理后的文本中识别断点位置,最后,根据识别到的断点位置,对待分词文本进行分词处理。一方面,由于本实施例中在分词时无需计算左右邻接熵、邻接变化数等复杂的参数,因此可以提高分词效率,另一方面,由于本实施例中对两个相邻字符之间是否存在断点进行判断,而没有限制分词后的词的长度,因此可以通过分词得到任意长度的词。因此,通过本实施例,能够解决解决现有的分词方法分词效率低,且对分词后的词的长度有限制的问题。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请一个或多个中记载的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1为本申请一实施例提供的文本分词方法的流程示意图;
图2为本申请一实施例提供的检测两个相邻字符之间是否存在断点的流 程示意图;
图3为本申请另一实施例提供的检测两个相邻字符之间是否存在断点的流程示意图;
图4为本申请一实施例提供的文本分词装置的结构示意图;
图5为本申请一实施例提供的计算机设备的结构示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请一个或多个中的技术方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一个或多个一部分实施例,而不是全部的实施例。基于本申请一个或多个中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都应当属于本申请的保护范围。
需要说明的是,在不冲突的情况下,本申请中的一个或多个实施例以及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本申请实施例。
目前,可以基于统计的方法进行分词,该方法的主要原理为:通过大量的实验语料计算词语的词频、成词概率、左右邻接熵、邻接变化数等统计特征来识别文本中的词语。然而,这种分词方法中,一方面,由于需要计算左右邻接熵、邻接变化数等参数,计算过程复杂,导致分词的效率低,另一方面,对分词后的词的长度有限制,通常情况下,分词后的词中最多包含四个字,无法通过分词得到包含五个以及更多字的词。
为了解决现有的分词方法分词效率低,且对分词后的词的长度有限制的问题,本申请一实施例提供了一种文本分词方法,该方法的主要思路为:首先,获取待分词文本,对待分词文本进行预处理,得到处理后的文本,然后,在处理后的文本中进行断点位置识别,断点位置识别包括:计算处理后的文本中两个相邻字符之间存在断点的第一可能程度值,和/或,计算处理后的文本中两个相邻字符之间不存在断点的第二可能程度值,根据计算结果,在处理后的文本中识别断点位置,最后,根据识别到的断点位置,对待分词文本进行分词处理。一方面,由于本实施例中在分词时无需计算左右邻接熵、邻 接变化数等复杂的参数,因此可以提高分词效率,另一方面,由于本实施例中对两个相邻字符之间是否存在断点进行判断,没有限制分词后的词的长度,因此可以分词得到任意长度的词。因此,通过本实施例,能够解决现有的分词方法分词效率低,且对分词后的词的长度有限制的问题。
图1为本申请一实施例提供的文本分词方法的流程示意图,如图1所示,该流程包括以下步骤:
步骤S102,获取待分词文本,对待分词文本进行预处理,得到处理后的文本。
步骤S104,在处理后的文本中进行断点位置识别;断点位置识别包括:计算所述处理后的文本中两个相邻字符之间的断点可能程度值,根据所述断点可能程度值在所述处理后的文本中识别断点位置。
步骤S106,根据识别到的断点位置,对待分词文本进行分词处理。
可以看出,本实施例中,首先,获取待分词文本,对待分词文本进行预处理,得到处理后的文本,然后,在处理后的文本中进行断点位置识别,断点位置识别包括:计算所述处理后的文本中两个相邻字符之间的断点可能程度值,根据所述断点可能程度值在所述处理后的文本中识别断点位置。。一方面,由于本实施例中在分词时无需计算左右邻接熵、邻接变化数等复杂的参数,因此可以提高分词效率,另一方面,由于本实施例中对两个相邻字符之间是否存在断点进行判断,没有限制分词后的词的长度,因此可以分词得到任意长度的词。因此,通过本实施例,能够解决现有的分词方法分词效率低且对分词后的词的长度有限制的问题。
本实施例中的文本分词方法,可以由专门进行自然语言处理的服务器执行。
上述步骤S102中,获取待分词文本。在一个实施例中,可以从待分词文本集合中获取待分词文本。其中,待分词文本集合包括大量句子的集合,待分词文本可以是待分词文本集合中的任意一条句子。
上述步骤S102中,还对待分词文本进行预处理,得到处理后的文本。预处理的形式可以有多种,比如,将待分词文本的文本格式设置为默认格式,以便于进行文本分词,或者,将待分词文本中的标点符号去掉,以便于进行文本分词。
在一个实施例中,对待分词文本进行预处理,得到处理后的文本,具体 为:将预先建立的停用词库与待分词文本进行匹配,以确定位于待分词文本中且位于停用词库中的字符,根据确定的字符,在待分词文本中识别断点位置,将识别后的待分词文本,作为处理后的文本。
具体而言,预先建立停用词库,停用词库中包括预先设置的停用词、停用字、标点符号、数字以及特殊符号等。停用词库可以为多个,分别属于不同领域,比如停用词库有3个,分别属于金融、军事、政治领域。可以选择与待分词文本属于同一领域的停用词库对待分词文本进行预处理,以提高预处理的准确性。
在预处理时,将预先建立的停用词库与待分词文本进行匹配,以确定位于待分词文本中且位于停用词库中的字符,由于确定的字符位于停用词库中,相当于是已经分词后的词或单个字或单个数字或单个符号等,因此可以根据确定的字符,在待分词文本中识别断点位置,在待分词文本中,确定的字符之前的位置和确定的字符之后的位置,就是断点位置。并且,将识别后的待分词文本,作为处理后的文本。
在根据确定的字符,在待分词文本中识别断点位置之前,还可以在待分词文本中插入位置符,通过插入的位置符表示待分词文本中的每个字符在待分词文本中的位置。比如,对于待分词文本“这里有个小鸭子。”,在待分词文本中的第一个字符之前以及待分词文本中的每个字符之后,插入位置符,得到:
0这1里2有3个4小5鸭6子7。8
在插入位置符之后,根据确定的字符,在待分词文本中识别断点位置具体为:在确定的字符为停用词时,在待分词文本中,将该停用词之前的位置符所表示的位置和该词之后的位置符所表示的位置,作为断点位置;在确定的字符为停用字、标点符号、数字或特殊符号时,在待分词文本中,将该确定的字符之前的位置符所表示的位置和该确定的字符之后的位置符所表示的位置,作为断点位置。
具体而言,由于确定的字符位于待分词文本中且位于停用词库中,而在预先建立的停用词库中,不仅记录有停用词,还记录有停用字、标点符号、数字以及特殊符号等,因此确定的字符可以分成两种情况。一种情况下,确定的字符为停用词,比如对于待分词文本“这里有个小鸭子。”,其中,确定的字符是“鸭子”;另一种情况下,确定的字符为停用字、标点符号、数字或 特殊符号,比如对于待分词文本“你吃饭了吗?”,其中,确定的字符是“?”。
当确定的字符为停用词时,在待分词文本中,将该停用词之前的位置符所表示的位置和该词之后的位置符所表示的位置,作为断点位置。比如,对于待分词文本“这里有个小鸭子。”,其中,确定的字符是“鸭子”,则将“鸭子”之前的位置符“5”和“鸭子”之后的位置符“7”所表示的位置,作为断点位置。相应地,断点位置可以记录为(5,7)。
当确定的字符为停用字、标点符号、数字或特殊符号时,在待分词文本中,将该确定的字符之前的位置符所表示的位置和该确定的字符之后的位置符所表示的位置,作为断点位置。比如,对于待分词文本“这里有个小鸭子。”,其中,确定的字符是“。”,则将“。”之前的位置符“7”和“。”之后的位置符“8”所表示的位置,作为断点位置。相应地,断点位置可以记录为(7,8)。
由此可见,通过将预先建立的停用词库与待分词文本进行匹配,可以在待分词文本中初步识别断点位置并记录下来,通过停用词库匹配的方式,可以在待分词文本中初始识别出来一些较为常见的停用词或字或符号所引起的断点位置,提高断点位置识别的效率。
需要说明的是,由于在对待分词文本的预处理过程中,通过停用词库匹配的方式确定了一些断点位置,相应地,在上述步骤S106中,根据识别到的断点位置,对待分词文本进行分词处理时,所依据的断点位置,包括预处理过程中得到的断点位置和步骤S104识别的断点位置。
进一步地,本实施例中,在将识别后的待分词文本,作为处理后的文本之后,还包括记录上述确定的字符在待分词文本中的位置,以在步骤S104的断点位置识别过程中跳过上述确定的字符。
具体而言,当确定的字符为停用词时,在待分词文本中,将该停用词之前的位置符所表示的位置和该词之后的位置符所表示的位置,作为断点位置,并将该断点位置作为确定的字符在待分词文本中的位置记录下来,以在步骤S104的断点位置识别过程中跳过确定的字符。比如,对于待分词文本“这里有个小鸭子。”,其中,确定的字符是“鸭子”,则将“鸭子”之前的位置符“5”和“鸭子”之后的位置符“7”所表示的位置,作为断点位置也即确定的字符在待分词文本中的位置。相应地,确定的字符的位置可以记录为(5,7)。
当确定的字符为停用字、标点符号、数字或特殊符号时,在待分词文本中,将该确定的字符之前的位置符所表示的位置和该确定的字符之后的位置 符所表示的位置,作为断点位置,并将该断点位置作为确定的字符在待分词文本中的位置记录下来,以在步骤S104的断点位置识别过程中跳过确定的字符。比如,对于待分词文本“这里有个小鸭子。”,其中,确定的字符是“。”,则将“。”之前的位置符“7”和“。”之后的位置符“8”所表示的位置,作为断点位置也即确定的字符在待分词文本中的位置。相应地,确定的字符的位置可以记录为(7,8)。
能够理解的是,在以上方式中,根据确定的字符,在待分词文本中识别出来的断点位置相当于确定的字符在待分词文本中的位置。通过本实施例的以上过程,在将预先建立的停用词库与待分词文本进行匹配,以确定位于待分词文本中且位于停用词库中的字符之后,还将确定的字符在待分词文本中的位置记录下来,以便于在步骤S104中的断点位置识别过程中跳过上述确定的字符,不对确定的字符进行二次识别。
上述步骤S104中,在处理后的文本中进行断点位置识别,断点位置识别包括:计算所述处理后的文本中两个相邻字符之间的断点可能程度值,根据所述断点可能程度值在所述处理后的文本中识别断点位置。
所述计算所述处理后的文本中两个相邻字符之间的断点可能程度值,包括:计算所述处理后的文本中两个相邻字符之间存在断点的第一可能程度值。
所述计算所述处理后的文本中两个相邻字符之间的断点可能程度值,包括:计算所述处理后的文本中两个相邻字符之间不存在断点的第二可能程度值。
所述计算所述处理后的文本中两个相邻字符之间的断点可能程度值,包括:计算所述处理后的文本中两个相邻字符之间存在断点的第一可能程度值;及计算所述处理后的文本中两个相邻字符之间不存在断点的第二可能程度值。
这里对处理后的文本中两个相邻字符的含义进行说明。根据以上过程可知,基于停用词库识别后的待分词文本,即为处理后的文本,而且,对于通过停用词库确定得到的字符,并没有在待分词文本中删除,因此,待分词文本所包含的字符与处理后的文本所包含的字符相同。但是,由于通过停用词库确定得到的字符不需要再参与步骤S104中的断点位置识别过程,因此在步 骤S104中,在断点位置识别时所涉及的两个相邻字符,指的是处理后的文本中的任意两个相邻字符,该相邻字符不包括通过停用词库确定的字符。步骤S104中,通过断点位置识别过程,识别处理后的文本中该两个相邻字符之间是否存在断点。
比如,对于待分词文本“今天吃饭了吗”,通过停用词库确定得到的字符为“吃饭”,而“吃饭”不再参与步骤S104中的断点位置识别过程,因此在步骤S104中,在断点位置识别时所涉及到的两个相邻字符,指的是处理后的文本“今天吃饭了吗”中任意两个相邻字符,该相邻字符不包括“吃饭”。两个相邻字符可以包括“今天”和“了吗”,其中,“天”和“了”由于之间还存在“吃饭”,因此不算是相邻字符。步骤S104中,通过断点位置识别过程,识别“今天”之间是否存在断点,以及,识别“了吗”之间是否存在断点。
下面介绍在步骤S104中,如何执行断点位置识别,以跳过通过停用词库匹配的方式确定得到的字符。
在一个实施例中,在步骤S104中,计算所述处理后的文本中两个相邻字符之间的断点可能程度值,根据所述断点可能程度值在所述处理后的文本中识别断点位置,具体为:
(a1)将处理后的文本的起始位置作为遍历起始位置并执行第一遍历过程;
(a2)若识别到断点位置,则将断点位置作为遍历起始位置并重复执行第一遍历过程,若遍历到记录的所述位置,则从遍历到的位置开始继续遍历,直至遍历到下一个记录的位置,将下一个记录的位置作为遍历起始位置并重复执行第一遍历过程。
动作(a1)中,将处理后的文本的起始位置也即处理后的文本中的第一个字符之前的位置作为遍历起始位置,并执行第一遍历过程。在第一遍历过程中,从遍历启示位置开始遍历每两个相邻字符,计算遍历到的每两个相邻字符之间存在断点的第一可能程度值,和/或,计算遍历到的每两个相邻字符之间不存在断点的第二可能程度值,根据计算结果,识别遍历到的每两个相邻字符之间是否存在断点。一种情况下,根据计算结果,确定遍历到的每两个相邻字符之间存在断点,则确定遍历到断点位置。另一种情况下,没有遍历到断点位置,但是遍历到了之前记录的确定的字符在待分词文本中的位置。再一种情况下,遍历到处理后的文本的结束位置并确定遍历结束,结束位置 为处理后的文本中的最后一个字符之后的位置。
一个例子中,对于添加有位置符的处理后的文本“0今1天2吃3饭4了5吗6”,其中,“吃饭”为上述确定的字符,相应的记录的位置为(2,4)。对于该处理后的文本,从位置“0”开始遍历,若确定“今天”之前存在断点,则确定遍历到断点位置,断点位置为1,若确定“今天”之间没有断点,则继续遍历,发现遍历到位置2,位置2即为预先记录的确定的字符在待分词文本中的位置。假如在该句话中“0今1天2吃3饭4了5吗6”,不存在预先通过停用词库确定的字符,则从位置“0”开始遍历,并确定任何两个相邻字符之间都没有断点,则直接遍历到处理后的文本的结束位置6。
动作(a2)中,若识别到断点位置,则将断点位置作为遍历起始位置并重复执行第一遍历过程。参考前面的例子,“今天”之间存在断点,则断点位置为1,从将位置1作为遍历起始位置,继续向后遍历。这里需要说明一个特殊情况,从位置1开始遍历,发现下一个位置是位置2,而且位置1和位置2之间只存在一个字符,则该一个字符就是一个独立的分词结果。与这种情况类似的例子还有,对于“0你1好2美3丽4”这句话,假设通过匹配停用词库的方式记录的位置为(1,4),则从位置0开始遍历,发现位置1就是记录的位置,并且,位置0和位置1之间只有一个字符,则该一个字符就是一个独立的分词结果。需要说明的是,在匹配停用词库之后,已经根据确定的字符,在待分词文本中识别断点位置并记录,因此记录的位置就可以作为断点位置参与分词。
动作(a2)中,若在遍历过程中,遍历到了之前记录的通过匹配停用词库的方式记录的位置,则从遍历到的位置开始继续遍历,直至遍历到下一个记录的位置,将下一个记录的位置作为遍历起始位置并重复执行第一遍历过程。以句子“0今1天2吃3饭4了5吗6”为例,预先记录的位置为(2,4),假设在遍历过程中,确定“今天”之间没有断点,则继续遍历遇到位置2,则从位置2开始继续遍历,直到遍历到预先记录的位置2的下一个位置,位置4,将位置4作为遍历起始位置,并重复执行第一遍历过程。
第一遍历过程包括:从遍历起始位置开始遍历每两个相邻字符,计算遍历到的每两个相邻字符之间存在断点的第一可能程度值,和/或,计算遍历到的每两个相邻字符之间不存在断点的第二可能程度值,根据计算结果,识别遍历到的每两个相邻字符之间是否存在断点,直至识别到断点位置或者遍历 到记录的位置或者遍历到处理后的文本的结束位置。
本实施例中,在记录通过停用词库匹配到的字符的位置时,可以将字符的位置以特定格式记录,比如格式为(起始位置、结束位置),因此在遍历到记录的位置时,该位置为匹配到的字符的起始位置,继续向后遍历,遍历到的下一个记录的位置,即为匹配到的字符的结束位置,从而,将遍历到的下一个记录的位置作为遍历起始位置并重复执行第一遍历过程。
可以看出,本实施例中,通过以上动作(a1)和(a2),可以通过循环遍历的方式跳过预先记录的位置,以不对预先匹配出来的字符进行二次识别。
在另一个实施例中,在步骤S104中,计算所述处理后的文本中两个相邻字符之间的断点可能程度值,根据所述断点可能程度值在所述处理后的文本中识别断点位置,具体为:
(b1)根据处理后的文本的起始位置、结束位置和记录的位置,对处理后的文本进行分段,得到多段子文本;
(b2)针对每段子文本,将子文本的起始位置作为遍历起始位置并执行第二遍历过程;第二遍历过程包括:从遍历起始位置开始遍历每两个相邻字符,计算遍历到的每两个相邻字符之间存在断点的第一可能程度值,和/或,计算遍历到的每两个相邻字符之间不存在断点的第二可能程度值,根据计算结果,识别遍历到的每两个相邻字符之间是否存在断点,直至识别到断点位置或者遍历到子文本的结束位置;
(b3)若识别到断点位置,则将断点位置作为遍历起始位置并重复执行第二遍历过程。
动作(b1)中,根据处理后的文本的起始位置、结束位置和记录的位置,对处理后的文本进行分段,得到多段子文本。比如对于句子“0今1天2吃3饭4了5吗6”,其中“吃饭”为预先确定的字符,相应的记录的位置为(2,4),则根据处理后的文本的起始位置、结束位置和记录的位置,确定各段子文本的起始位置和结束位置,根据各段子文本的起始位置和结束位置,得到多段子文本,其中,分段得到的各段子文本中,不包括预先确定的字符。比如该例中,进行分段得到多段子文本,分别为“0今1天2”和“4了5吗6”。
动作(b2)中,在每段子文本中,将子文本的起始位置作为遍历起始位置并执行第二遍历过程。另一种情况下,没有遍历到断点位置,但是遍历到了子文本的结束位置,则确定遍历结束,结束位置为子文本中的最后一个字 符之后的位置。
动作(b3)中,若遍历到断点位置,则将断点位置作为遍历起始位置并重复执行第二遍历过程。
以子文本“0今1天2”为例,从位置0开始遍历,确定“今天”之间没有断点,并遍历到位置2确定遍历结束。以子文本“4了5吗6”为例,从位置4开始遍历,确定“了”与“吗”之间有断点,则记录断点位置5,并从位置5开始,遍历到位置6确定遍历结束。
可见,通过本实施例,能够根据处理后的文本的起始位置、结束位置和记录的位置,对处理后的文本进行分段,得到多段子文本,在每段子文本内通过循环遍历的方式跳过预先记录的位置,以不对预先匹配出来的字符进行二次识别。
在另一个实施例中,在每段子文本内,不需要进行循环遍历,可以计算子文本内每两个相邻字符之间存在断点的第一可能程度值,和/或,计算子文本内每两个相邻字符之间不存在断点的第二可能程度值,根据计算结果,在子文本内识别断点位置。
第二遍历过程包括:从遍历起始位置开始遍历每两个相邻字符,计算遍历到的每两个相邻字符之间存在断点的第一可能程度值,和/或,计算遍历到的每两个相邻字符之间不存在断点的第二可能程度值,根据计算结果,识别遍历到的每两个相邻字符之间是否存在断点。一种情况下,根据计算结果,确定遍历到的每两个相邻字符之间存在断点,则确定遍历到断点位置。
上述动作(a1)和上述动作(b2)均涉及到计算第一可能程度值和/或第二可能程度值,并根据计算结果识别两个相邻字符之间是否存在断点的过程,该过程与步骤S104中,计算所述处理后的文本中两个相邻字符之间的断点可能程度值,根据所述断点可能程度值,在所述处理后的文本中识别断点位置的过程类似。下面对步骤S104中,计算所述处理后的文本中两个相邻字符之间的断点可能程度值,根据所述断点可能程度值,在所述处理后的文本中识别断点位置的过程进行介绍,上述动作(a1)和上述动作(b2)的具体实现细节可以参考下面的描述。
步骤S104中,计算所述处理后的文本中两个相邻字符之间的断点可能程度值,根据所述断点可能程度值,在所述处理后的文本中识别断点位置,具体包括:
(c1)在处理后的文本中,计算两个相邻字符之间存在断点的第一可能程度值;若第一可能程度值小于或等于第一预设阈值,则确定两个相邻字符之间不存在断点,若第一可能程度值大于第一预设阈值,则计算两个相邻字符之间不存在断点的第二可能程度值;若第二可能程度值小于第二预设阈值,则确定两个相邻字符之间存在断点,若第二可能程度值大于或等于第二预设阈值,则确定两个相邻字符之间不存在断点。
或者,(c2)在处理后的文本中,计算两个相邻字符之间不存在断点的第二可能程度值;若第二可能程度值大于或等于第二预设阈值,则确定两个相邻字符之间不存在断点,若第二可能程度值小于第二预设阈值,则计算两个相邻字符之间存在断点的第一可能程度值;若第一可能程度值大于第一预设阈值,则确定两个相邻字符之间存在断点,若第一可能程度值小于或等于第一预设阈值,则确定两个相邻字符之间不存在断点。
图2为本申请一实施例提供的检测两个相邻字符之间是否存在断点的流程示意图,如图2所示,该流程包括:
步骤S202,计算两个相邻字符之间存在断点的第一可能程度值。
步骤S204,判断第一可能程度值是否大于第一预设阈值。
若是,执行步骤S206,否则,执行步骤S212。
步骤S206,计算两个相邻字符之间不存在断点的第二可能程度值。
步骤S208,判断第二可能程度值是否小于第二预设阈值;
若是,执行步骤S210,否则,执行步骤S212。
步骤S210,确定两个相邻字符之间存在断点。
步骤S212,确定两个相邻字符之间不存在断点。
图3为本申请另一实施例提供的检测两个相邻字符之间是否存在断点的流程示意图,如图3所示,该流程包括:
步骤S302,计算两个相邻字符之间不存在断点的第二可能程度值。
步骤S304,判断第二可能程度值是否小于第二预设阈值。
若是,执行步骤S306,否则,执行步骤S312。
步骤S306,计算两个相邻字符之间存在断点的第一可能程度值。
步骤S308,判断第一可能程度值是否大于第一预设阈值。
若是,执行步骤S310,否则,执行步骤S312。
步骤S310,确定两个相邻字符之间存在断点。
步骤S312,确定两个相邻字符之间不存在断点。
从图2和图3的流程可以看出,为第一可能程度值设置有第一预设阈值,为第二可能程度值设置有第二预设阈值,第一预设阈值和第二预设阈值可以根据需要设定。从图2和图3的流程可以看出,可以先计算两个相邻字符之间存在断点的第一可能程度值,也计算两个相邻字符之间不存在断点的第二可能程度值。在两个相邻字符之间不存在断点的情况下,只需要计算第一可能程度值和第二可能程度值中的一者即可,在两个相邻字符之间存在断点的情况下,需要计算第一可能程度值和第二可能程度值两者,因此本实施例中,实质上是根据两个相邻字符之间存在断点的第一可能程度值,和/或,每两个相邻字符之间不存在断点的第二可能程度值,检测两个相邻字符之间是否存在断点。
在一个实施例中,对于处理后的文本中的两个相邻字符而言,计算两个相邻字符之间存在断点的第一可能程度值,具体为:
(d1)获取两个相邻字符中的每个字符在预设文本库中的第一出现次数,以及,获取两个相邻字符相邻出现在预设文本库中的第二出现次数。
(d2)根据两个相邻字符中的每个字符对应的第一出现次数和上述的第二出现次数,计算两个相邻字符之间存在断点的第一可能程度值。
根据前文可知,处理后的文本来自于待分词文本,待分词文本可以来自于待分词文本集合,因此本步骤中,预设文本库可以为待分词文本集合,当然,预设文本库也可以为其他预先建立的包括大量文本的文本库。
步骤(d1)中,获取两个相邻字符中的每个字符在预设文本库中的第一出现次数。并且,获取两个相邻字符相邻出现在预设文本库中的第二出现次数。其中,在获取第一出现次数时,包括两个相邻字符中的一个字符与两个相邻字符中的另一个字符相邻出现的情况,也就是说,第一出现次数中包括第二出现次数。能够理解,第二出现次数实际为两个相邻字符作为一个词出现在预设文本库中的出现次数。
步骤(d2)中,根据两个相邻字符中的每个字符对应的第一出现次数和第二出现次数,计算两个相邻字符之间存在断点的第一可能程度值。在一个实施例中,步骤(d2)具体可以为:
(d21)将两个相邻字符中的每个字符对应的第一出现次数进行相乘,得到乘积,计算该乘积与第二出现次数的比值。
(d22)根据该比值,确定两个相邻字符之间存在断点的第一可能程度值。
步骤(d21)中,将两个相邻字符中的每个字符对应的第一出现次数进行相乘,得到一乘积。并且,利用该乘积除以两个相邻字符作为一个词出现在预设文本库中的第二出现次数,得到一比值。步骤(d22)中,根据该比值,确定两个相邻字符之间存在断点的第一可能程度值。
一个例子中,得到的处理后的文本为“有个小鸭子”,本步骤中,从“有”开始,计算“有”和“个”之间存在断点的第一可能程度值,具体计算过程为:
获取“有”在待分词文本集合中的出现次数作为第一出现次数,获取“个”在待分词文本集合中的出现次数作为第一出现次数,计算第一出现次数时,考虑“有”和“个”相邻出现的情况。以及,获取“有个”作为一个词整体在待分词文本集合中的出现次数作为第二出现次数。
利用公式(1)根据每个第一出现次数和第二出现次数,计算“有”和“个”之间存在断点的第一可能程度值。
公式(1)中,f(a)表示字符a在待分词文本集合中的出现次数,也即第一出现次数,f(b)表示字符b在待分词文本集合中的出现次数,也即第一出现次数,f(ab)分别代表字符a、b相邻出现在待分词文本集合中出现次数,也即第二出现次数。其中,a可以为“有”,b可以为“个”,ab可以为“有个”。公式(1)的计算结果即为第一可能程度值。
可以看出,本实施例中,可以根据两个相邻字符中的每个字符在预设文本库中的第一出现次数,以及,两个相邻字符相邻出现在预设文本库中的第二出现次数,计算两个相邻字符之间存在断点的第一可能程度值,计算过程简单,容易实现。
在一个实施例中,对于处理后的文本中的两个相邻字符而言,计算两个相邻字符之间不存在断点的第二可能程度值,具体为:
(e1)在处理后的文本中去掉两个相邻字符中的一个字符,得到第一文本,在处理后的文本中去掉两个相邻字符中的另一个字符,得到第二文本;
(e2)根据处理后的文本、第一文本和第二文本,计算两个相邻字符之间不存在断点的第二可能程度值。
步骤(e1)中,在处理后的文本中去掉两个相邻字符中的任意一个字符,得到第一文本,在处理后的文本中去掉两个相邻字符中的另一个字符,得到第二文本。步骤(e2)中,根据处理后的文本、第一文本和第二文本,计算两个相邻字符之间不存在断点的第二可能程度值。
在一个实施例中,根据处理后的文本、第一文本和第二文本,计算两个相邻字符之间不存在断点的第二可能程度值,具体为:
(e21)确定处理后的文本与第一文本之间的第一距离,确定处理后的文本与第二文本之间的第二距离。
(e22)将第一距离与第二距离的平均值,作为两个相邻字符之间不存在断点的第二可能程度值。
能够知道的是,两个文本之间的距离越小,说明两个文本之间的语义越相近,因此,在第一距离与第二距离的平均值小于一定阈值的情况下,说明去掉两个相邻字符中的任意一个字符,对处理后的文本的语义影响不大,因此可以确定两个相邻的字符不是一个完整的词,即,两个相邻的字符之间可能存在断点。相反地,在第一距离与第二距离的平均值大于或等于一定阈值的情况下,说明去掉两个相邻字符中的任意一个字符,对处理后的文本的语义影响较,因此可以确定两个相邻的字符是一个完整的词,即,两个相邻的字符之间不存在断点。
依照上述的例子,得到的处理后的文本为“有个小鸭子”,本步骤中,从“有”开始,计算“有”和“个”之间不存在断点的第二可能程度值,具体计算过程为:
在处理后的文本中去掉“有”,得到第一文本“个小鸭子”,在处理后的文本中去掉“个”,得到第二文本“有小鸭子”。计算文本“有个小鸭子”和“个小鸭子”之间的第一距离,计算文本“有个小鸭子”和“有小鸭子”之间的第二距离,取第一距离和第二距离的平均值,作为“有”和“个”之间不存在断点的第二可能程度值。
其中,第一距离和第二距离可以为欧式距离、余弦距离等。在计算第一距离和第二距离之前,需要将第一文本、第二文本和处理后的文本分别向量化,向量化的方式包括但不限于TF-IDF、word2vec、glove、ELMo、BERT等。
根据处理后的文本、第一文本和第二文本,计算两个相邻字符之间不存 在断点的第二可能程度值的过程,可以通过以下公式(2)表示。
其中,text_vec表示处理后的文本的向量,text_a_vec表示第一文本的向量,text_b_vec表示第二文本的向量。d(text_vec,text_a_vec)代表处理后的文本与第一文本之间的第一向量距离也即第一距离;d(text_vec,text_b_vec)代表处理后的文本与第二文本之间的第二向量距离也即第二距离。
可以看出,本实施例中,可以通过在处理后的文本中去掉字符的方式,得到第一文本和第二文本,根据处理后的文本与第一文本之间的第一距离和处理后的文本与第二文本之间的第二距离,确定两个相邻字符之间不存在断点的第二可能程度值,计算过程简单,容易实现。
在一个具体的实施例中,依照上述的例子,得到的处理后的文本为“有个小鸭子”,在步骤S104中,先通过上面的步骤(d1)和(d2),计算“有”和“个”之间存在断点的第一可能程度值,若该第一可能程度值小于或者等于第一预设阈值,则确定“有”和“个”之间不存在断点,若该第一可能程度值大于第一预设阈值,则通过上面的步骤(e1)和(e2),计算“有”和“个”之间不存在断点的第二可能程度值,若该第二可能程度值小于第二预设阈值,则确定“有”和“个”之间存在断点,若该第二可能程度值大于或等于第二预设阈值,则确定“有”和“个”之间不存在断点。
通过以上过程,在步骤S104中,识别两个相邻字符之间是否存在断点,从而在处理后的文本中识别断点位置并记录。
在步骤S106中,根据识别到的断点位置,对待分词文本进行分词处理。需要说明的是,步骤S106中所使用的识别到的断点位置,包括步骤S104中识别到的断点位置,还包括步骤S102预处理过程中记录的断点位置。另外,还可以将待分词文本中的第一个字符之前的位置和最后一个字符之后的位置,也记录为断点位置。
在一个具体的例子中,待分词文本为“这里有个小鸭子”。插入位置符后为“0这1里2有3个4小5鸭6子7”。通过步骤S102,确定得到字符“鸭子”,并记录断点位置“5、7”,并且,将断点位置5、7作为确定的字符的位置记录下来。
接着,通过步骤S104,根据位置5、7对文本进行分段,得到一个子文 本“0这1里2有3个4小5”。在子文本内,从起始位置0开始遍历,得到“这里”,并计算“这”和“里”之间存在断点的第一可能程度值和不存在断点的第二可能程度值,进而确定“这里”之间不存在断点,接着,遍历到“里有”,计算“里”和“有”之间存在断点的第一可能程度值和不存在断点的第二可能程度值,进而确定“里有”之间存在断点,继而记录断点位置2,接着,从位置2开始继续遍历到“有个”,计算“有”和“个”之间存在断点的第一可能程度值和不存在断点的第二可能程度值,确定“有个”之间存在断点,继而记录断点位置3,从位置3开始继续遍历,遍历到“个小”,计算“个”和“小”之间存在断点的第一可能程度值和不存在断点的第二可能程度值,确定“个小”之间存在断点,继而记录断点位置4,接着,从位置4开始继续遍历,遍历到位置5,确定遍历结束,最终得到断点位置“2、3、4”。
并且,将待分词文本中的第一个字符之前的位置“0”和最后一个字符之后的位置“7”,作为断点位置记录下来。
由此可见,断点位置具有三种不同来源,而这三种来源中,存在重合的位置,比如上述的位置7,则还可以对记录的断点位置进行去重,得到最终的断点位置0、2、3、4、5、7。
在另一个实施例中,在步骤S104中,对子文本“0这1里2有3个4小5”进行断点识别并记录断点位置时,在遍历到位置2确定存在断点时,可以记录断点位置为0、2,在遍历到位置3确定存在断点时,可以记录断点位置2、3,也即,在遍历到断点时,将遍历起始位置和断点位置都记录下来,从而,得到多个断点位置分别为0、2、2、3、3、4、4、5,基于此,在得到的多个断点位置中去重,得到断点位置0、2、3、4、5。这里得到的断点位置与步骤S102得到的断点位置以及待分词文本的首尾位置进行合并去重,得到最终的断点位置。
上述步骤S106中,在记录的断点位置处,对待分词文本进行分词处理。具体地,在记录的断点位置处,插入分隔符,从而对待分词文本进行分词。比如得到分词结果:|这里|有|个|小|鸭子|。
本实施例中,在对待分词文本进行分词时可以采用任意断点符号进行分割,分词过程中不需要计算复杂的参数,实现方式简单,分词效率高。
综上,以上提供的文本分词方法实施例至少具有以下技术效果:
(1)无需计算左右邻接熵、邻接变化数等复杂的参数,可以提高分词效 率。
(2)对分词后的词的长度没有限制,可以分词得到任意长度的词。
(3)通过不同的方式计算两个相邻字符之间存在断点的第一可能程度值和不存在断点的第二可能程度值,能够提高断点检测的准确率,进而提高分词准确率。
(4)该文本分词方法可迁移性强、通用性强,可以应用于各种场景,比如,应用于新词识别领域,在分词的基础上通过统计词频的方式进行新词发现。
图4为本申请一实施例提供的文本分词装置的结构示意图,如图4所示,该装置包括:
预处理单元41,用于获取待分词文本,对所述待分词文本进行预处理,得到处理后的文本。
断点识别单元42,用于计算所述处理后的文本中两个相邻字符之间的断点可能程度值,根据所述断点可能程度值,在所述处理后的文本中识别断点位置。
分词处理单元43,用于根据识别到的断点位置,对所述待分词文本进行分词处理。
可选的,断点识别单元42具体用于计算所述处理后的文本中两个相邻字符之间存在断点的第一可能程度值。
可选的,断点识别单元42具体用于计算所述处理后的文本中两个相邻字符之间不存在断点的第二可能程度值。
可选的,断点识别单元42具体用于计算所述处理后的文本中两个相邻字符之间存在断点的第一可能程度值;及计算所述处理后的文本中两个相邻字符之间不存在断点的第二可能程度值。
可选地,预处理单元41具体用于:将预先建立的停用词库与所述待分词文本进行匹配,以确定位于所述待分词文本中且位于所述停用词库中的字符;根据确定的字符,在所述待分词文本中识别断点位置,将识别后的所述待分词文本,作为处理后的文本。
可选地,还包括位置记录单元44,用于:在得到所述处理后的文本之后,记录所述确定的字符在所述待分词文本中的位置,以在所述断点位置识别过 程中跳过所述确定的字符。
可选地,断点识别单元42具体用于:将所述处理后的文本的起始位置作为遍历起始位置并执行第一遍历过程
若识别到断点位置,则将所述断点位置作为所述遍历起始位置并重复执行第一遍历过程,若遍历到记录的所述位置,则从遍历到的所述位置开始继续遍历,直至遍历到下一个记录的所述位置,将下一个记录的所述位置作为所述遍历起始位置并重复执行第一遍历过程。
可选的,;所述第一遍历过程包括:从所述遍历起始位置开始遍历每两个相邻字符,计算遍历到的每两个相邻字符之间存在断点的第一可能程度值,和/或,计算遍历到的每两个相邻字符之间不存在断点的第二可能程度值,根据计算结果,识别遍历到的每两个相邻字符之间是否存在断点,直至识别到断点位置或者遍历到记录的所述位置或者遍历到所述处理后的文本的结束位置。
可选地,断点识别单元42具体用于:
根据所述处理后的文本的起始位置、结束位置和记录的所述位置,对所述处理后的文本进行分段,得到多段子文本;
针对每段子文本,将所述子文本的起始位置作为遍历起始位置并执行第二遍历过程;
若识别到断点位置,则将所述断点位置作为所述遍历起始位置并重复执行第二遍历过程。
可选的,所述第二遍历过程包括:从所述遍历起始位置开始遍历每两个相邻字符,计算遍历到的每两个相邻字符之间存在断点的第一可能程度值,和/或,计算遍历到的每两个相邻字符之间不存在断点的第二可能程度值,根据计算结果,识别遍历到的每两个相邻字符之间是否存在断点,直至识别到断点位置或者遍历到所述子文本的结束位置。
可选地,断点识别单元42具体用于:
在所述处理后的文本中,计算两个相邻字符之间存在断点的第一可能程度值;若所述第一可能程度值小于或等于第一预设阈值,则确定两个相邻字符之间不存在断点,若所述第一可能程度值大于第一预设阈值,则计算两个相邻字符之间不存在断点的第二可能程度值;若所述第二可能程度值小于第二预设阈值,则确定两个相邻字符之间存在断点,若所述第二可能程度值大 于或等于第二预设阈值,则确定两个相邻字符之间不存在断点;
或者,
在所述处理后的文本中,计算两个相邻字符之间不存在断点的第二可能程度值;若所述第二可能程度值大于或等于第二预设阈值,则确定两个相邻字符之间不存在断点,若所述第二可能程度值小于第二预设阈值,则计算两个相邻字符之间存在断点的第一可能程度值;若所述第一可能程度值大于第一预设阈值,则确定两个相邻字符之间存在断点,若所述第一可能程度值小于或等于第一预设阈值,则确定两个相邻字符之间不存在断点。
可选地,断点识别单元42还具体用于:
获取两个相邻字符中的每个字符在预设文本库中的第一出现次数,以及,获取两个相邻字符相邻出现在所述预设文本库中的第二出现次数;
根据两个相邻字符中的每个字符对应的所述第一出现次数和所述第二出现次数,计算两个相邻字符之间存在断点的第一可能程度值。
可选地,断点识别单元42还具体用于:
在所述处理后的文本中去掉两个相邻字符中的一个字符,得到第一文本,在所述处理后的文本中去掉两个相邻字符中的另一个字符,得到第二文本;
根据所述处理后的文本、所述第一文本和所述第二文本,计算两个相邻字符之间不存在断点的第二可能程度值。
可选地,断点识别单元42还具体用于:
将两个相邻字符中的每个字符对应的所述第一出现次数进行相乘,得到乘积,计算所述乘积与所述第二出现次数的比值;
根据所述比值,确定两个相邻字符之间存在断点的第一可能程度值。
可选地,断点识别单元42还具体用于:
确定所述处理后的文本与所述第一文本之间的第一距离,确定所述处理后的文本与所述第二文本之间的第二距离;
将所述第一距离与所述第二距离的平均值,作为两个相邻字符之间不存在断点的第二可能程度值。
需要说明的是,本实施例中的文本分词装置能够实现前述的文本分词方法实施例的各个过程,并达到相同的效果和功能,这里不再重复。
本申请一实施例还提供了一种计算机设备,该设备具体可以为上述的数据库日志解析设备,用于执行上述的数据校验方法,图5为本申请一实施例 提供的计算机设备的结构示意图,如图5所示。计算机设备可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上的处理器1001和存储器1002,存储器1002中可以存储有一个或一个以上存储应用程序或数据。其中,存储器1002可以是短暂存储或持久存储。存储在存储器1002的应用程序可以包括一个或一个以上模块(图示未示出),每个模块可以包括对计算机设备中的一系列计算机可执行指令。更进一步地,处理器1001可以设置为与存储器1002通信,在计算机设备上执行存储器1002中的一系列计算机可执行指令。计算机设备还可以包括一个或一个以上电源1003,一个或一个以上有线或无线网络接口1004,一个或一个以上输入输出接口1005,一个或一个以上键盘1006等。
在一个具体的实施例中,计算机设备,包括:处理器;以及被安排成存储计算机可执行指令的存储器,所述计算机可执行指令被配置由所述处理器执行,以实现以下流程:
获取待分词文本,对所述待分词文本进行预处理,得到处理后的文本;
在所述处理后的文本中进行断点位置识别;所述断点位置识别包括:
计算所述处理后的文本中两个相邻字符之间存在断点的第一可能程度值,和/或,计算所述处理后的文本中两个相邻字符之间不存在断点的第二可能程度值,根据计算结果,在所述处理后的文本中识别断点位置;
根据识别到的断点位置,对所述待分词文本进行分词处理。
需要说明的是,本实施例中的计算机设备能够实现前述的文本分词方法实施例的各个过程,并达到相同的效果和功能,这里不再重复。
本申请实施例还提供了一种存储介质,用于存储计算机可执行指令。
在一种具体的实施例中,该存储介质可以为U盘、光盘、硬盘等,该存储介质存储的计算机可执行指令在被处理器执行时,能实现以下流程:
获取待分词文本,对所述待分词文本进行预处理,得到处理后的文本;
在所述处理后的文本中进行断点位置识别;所述断点位置识别包括:
计算所述处理后的文本中两个相邻字符之间存在断点的第一可能程度值,和/或,计算所述处理后的文本中两个相邻字符之间不存在断点的第二可能程度值,根据计算结果,在所述处理后的文本中识别断点位置;
根据识别到的断点位置,对所述待分词文本进行分词处理。
需要说明的是,本实施例中的存储介质能够实现前述的文本分词方法实 施例的各个过程,并达到相同的效果和功能,这里不再重复。
上述对本申请特定实施例进行了描述。其它实施例在所附权利要求书的范围内。在一些情况下,在权利要求书中记载的动作或步骤可以按照不同于实施例中的顺序来执行并且仍然可以实现期望的结果。另外,在附图中描绘的过程不一定要求示出的特定顺序或者连续顺序才能实现期望的结果。在某些实施方式中,多任务处理和并行处理也是可以的或者可能是有利的。
本领域内的技术人员应明白,本申请实施例可提供为方法、系统或计算机程序产品。因此,本申请实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可读存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(Central Proces sing Unit,CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(R  andom Access Memory,RAM)和/或非易失性内存等形式,如只读存储器(R ead-Only Memory,ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(Phase-change RAM,PRAM)、静态随机存取存储器(Static RAM,SRA M)、动态随机存取存储器(Dynamic RAM,DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(Electri cally Erasable Programmable ROM,EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
本申请实施例可以在由计算机执行的计算机可执行指令的一般上下文中描述,例如程序模块。一般地,程序模块包括执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。也可以在分布式计算环境中实践本申请的一个或多个实施例,在这些分布式计算环境中,由通过通信网络而被连接的远程处理设备来执行任务。在分布式计算环境中,程序模块可以位于包括存储设备在内的本地和远程计算机存储介质中。
本申请中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统实施例而言,由于其基本相似于方法实施例,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
以上所述仅为本文件的实施例而已,并不用于限制本文件。对于本领域 技术人员来说,本文件可以有各种更改和变化。凡在本文件的精神和原理之内所作的任何修改、等同替换、改进等,均应包含在本文件的权利要求范围之内。

Claims (18)

  1. 一种文本分词方法,包括:
    获取待分词文本,对所述待分词文本进行预处理,得到处理后的文本;
    计算所述处理后的文本中两个相邻字符之间的断点可能程度值,根据所述断点可能程度值在所述处理后的文本中识别断点位置;
    根据识别到的断点位置,对所述待分词文本进行分词处理。
  2. 根据权利要求1所述的方法,所述计算所述处理后的文本中两个相邻字符之间的断点可能程度值,包括:
    计算所述处理后的文本中两个相邻字符之间存在断点的第一可能程度值。
  3. 根据权利要求1所述的方法,所述计算所述处理后的文本中两个相邻字符之间的断点可能程度值,包括:
    计算所述处理后的文本中两个相邻字符之间不存在断点的第二可能程度值。
  4. 根据权利要求1所述的方法,所述计算所述处理后的文本中两个相邻字符之间的断点可能程度值,包括:
    计算所述处理后的文本中两个相邻字符之间存在断点的第一可能程度值;及
    计算所述处理后的文本中两个相邻字符之间不存在断点的第二可能程度值。
  5. 根据权利要求1所述的方法,其中,所述对所述待分词文本进行预处理,得到处理后的文本,包括:
    将预先建立的停用词库与所述待分词文本进行匹配,以确定位于所述待分词文本中且位于所述停用词库中的字符;
    根据所述确定的字符,在所述待分词文本中识别断点位置,将识别后 的所述待分词文本,作为处理后的文本。
  6. 根据权利要求1所述的方法,其中,在得到所述处理后的文本之后,所述方法还包括:
    记录所述确定的字符在所述待分词文本中的位置,以在所述断点位置识别过程中跳过所述确定的字符。
  7. 根据权利要求6所述的方法,其中,所述计算所述处理后的文本中两个相邻字符之间的断点可能程度值,根据所述断点可能程度值,在所述处理后的文本中识别断点位置,包括:
    将所述处理后的文本的起始位置作为遍历起始位置并执行第一遍历过程;
    若识别到断点位置,则将所述断点位置作为所述遍历起始位置并重复执行第一遍历过程,直至将所述处理后的文本处理完成;
    若遍历到记录的所述位置,则从遍历到的所述位置开始继续遍历,直至遍历到下一个记录的所述位置,将下一个记录的所述位置作为所述遍历起始位置并重复执行第一遍历过程,直至将所述处理后的文本处理完成。
  8. 根据权利要求7所述的方法,所述第一遍历过程包括:从所述遍历起始位置开始遍历每两个相邻字符,计算遍历到的每两个相邻字符之间存在断点的第一可能程度值,和/或,计算遍历到的每两个相邻字符之间不存在断点的第二可能程度值,根据计算结果,识别遍历到的每两个相邻字符之间是否存在断点,直至识别到断点位置或者遍历到记录的所述位置或者遍历到所述处理后的文本的结束位置。
  9. 根据权利要求7所述的方法,其中,所述计算所述处理后的文本中两个相邻字符之间的断点可能程度值,根据所述断点可能程度值,在所述处理后的文本中识别断点位置,包括:
    根据所述处理后的文本的起始位置、结束位置和记录的所述位置,对所述处理后的文本进行分段,得到多段子文本;
    针对每段子文本,将所述子文本的起始位置作为遍历起始位置并执行第二遍历过程;
    若识别到断点位置,则将所述断点位置作为所述遍历起始位置并重复执行第二遍历过程,直至遍历完对应的子文本。
  10. 根据权利要求9所述的方法,所述第二遍历过程包括:从所述遍历起始位置开始遍历每两个相邻字符,计算遍历到的每两个相邻字符之间存在断点的第一可能程度值,和/或,计算遍历到的每两个相邻字符之间不存在断点的第二可能程度值,根据计算结果,识别遍历到的每两个相邻字符之间是否存在断点,直至识别到断点位置或者遍历到所述子文本的结束位置。
  11. 根据权利要求1所述的方法,其中,所述计算所述处理后的文本中两个相邻字符之间的断点可能程度值,根据所述断点可能程度值,在所述处理后的文本中识别断点位置,包括:
    在所述处理后的文本中,计算两个相邻字符之间存在断点的第一可能程度值;若所述第一可能程度值小于或等于第一预设阈值,则确定两个相邻字符之间不存在断点,若所述第一可能程度值大于第一预设阈值,则计算两个相邻字符之间不存在断点的第二可能程度值;若所述第二可能程度值小于第二预设阈值,则确定两个相邻字符之间存在断点,若所述第二可能程度值大于或等于第二预设阈值,则确定两个相邻字符之间不存在断点;
    或者,
    在所述处理后的文本中,计算两个相邻字符之间不存在断点的第二可能程度值;若所述第二可能程度值大于或等于第二预设阈值,则确定两个相邻字符之间不存在断点,若所述第二可能程度值小于第二预设阈值,则计算两个相邻字符之间存在断点的第一可能程度值;若所述第一可能程度值大于第一预设阈值,则确定两个相邻字符之间存在断点,若所述第一可能程度值小于或等于第一预设阈值,则确定两个相邻字符之间不存在断点。
  12. 根据权利要求2所述的方法,其中,所述计算两个相邻字符之间存在断点的第一可能程度值,包括:
    获取两个相邻字符中的每个字符在预设文本库中的第一出现次数,以及,获取两个相邻字符相邻出现在所述预设文本库中的第二出现次数;
    根据两个相邻字符中的每个字符对应的所述第一出现次数和所述第二出现次数,计算两个相邻字符之间存在断点的第一可能程度值。
  13. 根据权利要求3所述的方法,其中,所述计算两个相邻字符之间不存在断点的第二可能程度值,包括:
    在所述处理后的文本中去掉两个相邻字符中的一个字符,得到第一文本,在所述处理后的文本中去掉两个相邻字符中的另一个字符,得到第二文本;
    根据所述处理后的文本、所述第一文本和所述第二文本,计算两个相邻字符之间不存在断点的第二可能程度值。
  14. 根据权利要求12所述的方法,其中,所述根据两个相邻字符中的每个字符对应的所述第一出现次数和所述第二出现次数,计算两个相邻字符之间存在断点的第一可能程度值,包括:
    将两个相邻字符中的每个字符对应的所述第一出现次数进行相乘,得到乘积,计算所述乘积与所述第二出现次数的比值;
    根据所述比值,确定两个相邻字符之间存在断点的第一可能程度值。
  15. 根据权利要求13所述的方法,其中,所述根据所述处理后的文本、所述第一文本和所述第二文本,计算两个相邻字符之间不存在断点的第二可能程度值,包括:
    确定所述处理后的文本与所述第一文本之间的第一距离,确定所述处理后的文本与所述第二文本之间的第二距离;
    将所述第一距离与所述第二距离的平均值,作为两个相邻字符之间不存在断点的第二可能程度值。
  16. 一种文本分词装置,包括:
    预处理单元,用于获取待分词文本,对所述待分词文本进行预处理,得到处理后的文本;
    断点识别单元,计算所述处理后的文本中两个相邻字符之间的断点可能程度值,根据所述断点可能程度值,在所述处理后的文本中识别断点位置;
    分词处理单元,用于根据识别到的断点位置,对所述待分词文本进行分词处理。
  17. 一种计算机设备,所述设备包括:
    处理器;以及
    被安排成存储计算机可执行指令的存储器,所述计算机可执行指令被配置由所述处理器执行,所述计算机可执行指令用于执行如权利要求1-15任一项所述的方法中的步骤。
  18. 一种存储介质,所述存储介质用于存储计算机可执行指令,所述计算机可执行指令使得计算机执行如权利要求1-15任一项所述的方法中的步骤。
PCT/CN2023/100021 2022-07-07 2023-06-13 文本分词方法、装置、计算机设备及存储介质 WO2024007827A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP23834601.9A EP4379599A1 (en) 2022-07-07 2023-06-13 Word segmentation method and apparatus for text, and computer device and storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210795690.9 2022-07-07
CN202210795690.9A CN117408248A (zh) 2022-07-07 2022-07-07 文本分词方法、装置、计算机设备及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/585,952 Continuation-In-Part US20240193353A1 (en) 2022-07-07 2024-02-23 Text segmentation method, computer device and storage medium

Publications (1)

Publication Number Publication Date
WO2024007827A1 true WO2024007827A1 (zh) 2024-01-11

Family

ID=89454167

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/100021 WO2024007827A1 (zh) 2022-07-07 2023-06-13 文本分词方法、装置、计算机设备及存储介质

Country Status (3)

Country Link
EP (1) EP4379599A1 (zh)
CN (1) CN117408248A (zh)
WO (1) WO2024007827A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090150145A1 (en) * 2007-12-10 2009-06-11 Josemina Marcella Magdalen Learning word segmentation from non-white space languages corpora
CN109492217A (zh) * 2018-10-11 2019-03-19 平安科技(深圳)有限公司 一种基于机器学习的分词方法及终端设备
CN109597987A (zh) * 2018-10-25 2019-04-09 阿里巴巴集团控股有限公司 一种文本还原方法、装置及电子设备
CN110705261A (zh) * 2019-09-26 2020-01-17 浙江蓝鸽科技有限公司 中文文本分词方法及其系统

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090150145A1 (en) * 2007-12-10 2009-06-11 Josemina Marcella Magdalen Learning word segmentation from non-white space languages corpora
CN109492217A (zh) * 2018-10-11 2019-03-19 平安科技(深圳)有限公司 一种基于机器学习的分词方法及终端设备
CN109597987A (zh) * 2018-10-25 2019-04-09 阿里巴巴集团控股有限公司 一种文本还原方法、装置及电子设备
CN110705261A (zh) * 2019-09-26 2020-01-17 浙江蓝鸽科技有限公司 中文文本分词方法及其系统

Also Published As

Publication number Publication date
EP4379599A1 (en) 2024-06-05
CN117408248A (zh) 2024-01-16

Similar Documents

Publication Publication Date Title
US11734329B2 (en) System and method for text categorization and sentiment analysis
US8762132B2 (en) Textual entailment recognition apparatus, textual entailment recognition method, and computer-readable recording medium
WO2019136993A1 (zh) 文本相似度计算方法、装置、计算机设备和存储介质
US8457950B1 (en) System and method for coreference resolution
CN106462604B (zh) 识别查询意图
CN110276071B (zh) 一种文本匹配方法、装置、计算机设备及存储介质
JP6335898B2 (ja) 製品認識に基づく情報分類
TW201439927A (zh) 提供資訊差距之指示之問答系統
US9009029B1 (en) Semantic hashing in entity resolution
CN111291177A (zh) 一种信息处理方法、装置和计算机存储介质
CN109471889B (zh) 报表加速方法、系统、计算机设备和存储介质
CN110210038B (zh) 核心实体确定方法及其系统、服务器和计算机可读介质
WO2020172649A1 (en) System and method for text categorization and sentiment analysis
CN108153728B (zh) 一种关键词确定方法及装置
US20210192203A1 (en) Text document categorization using rules and document fingerprints
CN110825840A (zh) 词库扩充方法、装置、设备及存储介质
CN113220821A (zh) 一种针对试题检索的索引建立方法、装置及电子设备
CN109902162B (zh) 基于数字指纹的文本相似性的识别方法、存储介质及装置
WO2024007827A1 (zh) 文本分词方法、装置、计算机设备及存储介质
CN115858776B (zh) 一种变体文本分类识别方法、系统、存储介质和电子设备
CN116578700A (zh) 日志分类方法、日志分类装置、设备及介质
CN115796146A (zh) 一种文件对比方法及装置
CN111159996B (zh) 基于文本指纹算法的短文本集合相似度比较方法及系统
CN114625889A (zh) 一种语义消歧方法、装置、电子设备及存储介质
CN115455975A (zh) 基于多模型融合决策提取主题关键词的方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23834601

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2023834601

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2023834601

Country of ref document: EP

Effective date: 20240229