WO2020232881A1 - Text word segmentation method and apparatus - Google Patents
Text word segmentation method and apparatus Download PDFInfo
- Publication number
- WO2020232881A1 WO2020232881A1 PCT/CN2019/103069 CN2019103069W WO2020232881A1 WO 2020232881 A1 WO2020232881 A1 WO 2020232881A1 CN 2019103069 W CN2019103069 W CN 2019103069W WO 2020232881 A1 WO2020232881 A1 WO 2020232881A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- word segmentation
- text
- segmentation result
- processed
- entries
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Definitions
- This application relates to the technical field of natural language processing, and in particular to a method and device for text segmentation.
- voice recognition refers to the decoding of voice signals into text information
- natural language processing refers to semantic analysis based on text information to obtain the user's request intention, so as to meet the user's functional needs.
- Chinese word segmentation is an important step in natural speech understanding, and its accuracy directly affects the performance of human-computer interaction products.
- word segmentation refers to the segmentation of sentences into individual words, which is the process of recombining consecutive sentences into word sequences according to certain specifications.
- word segmentation technology Take Chinese word segmentation technology as an example.
- the goal of word segmentation technology is to segment a sentence into individual Chinese words.
- the terminal converts the above voice information to obtain the text to be processed. Then, the terminal combines the character string in the text to be processed with a preset dictionary library according to a certain strategy. If an entry is found in the preset dictionary library, it means that the matching is successful. At this time, the entry is obtained, and then the word segmentation result of the text to be processed can be obtained.
- the segmentation results are not accurate enough due to the roughness and randomness of the segmentation process.
- the inaccuracy of the word segmentation results involved here refers to: in the process of segmenting the text to be processed according to a certain strategy, there are multiple word segmentation methods, and different word segmentation methods can produce different word segmentation results. In an ideal state Next, among these multiple word segmentation results, there is only one best word segmentation result.
- the entries collected in the preset dictionary database include: South, Southern City, City, City, Nanjing, then, in this case, the terminal will process the text
- the word segmentation result of can include: South City/City/Nanjing; it can also include: South/City/Nanjing, and the best word segmentation result in an ideal state should be: South/City/Nanjing.
- the embodiments of the present application provide a text word segmentation method and device, which can improve the accuracy of the terminal for word segmentation of the text to be processed.
- an embodiment of the present application provides a method for text segmentation, which includes:
- an embodiment of the present application provides a text word segmentation device.
- the text word segmentation device includes a unit for executing the method of the first aspect.
- the text word segmentation device includes:
- the obtaining unit is used to obtain the text to be processed
- the first word segmentation unit is configured to segment the to-be-processed text in the first direction according to the word segmentation strategy for string matching to obtain the first word segmentation result;
- the second word segmentation unit is configured to segment the to-be-processed text in the second direction according to the word segmentation strategy matched by the character string to obtain a second word segmentation result;
- the output unit is configured to output the first word segmentation result or the second word segmentation result when the first word segmentation result is consistent with the second word segmentation result.
- an embodiment of the present application provides another terminal, including a processor configured to call stored program instructions to execute the method of the first aspect.
- an embodiment of the present application provides a computer-readable storage medium that stores a computer program.
- the computer program includes program instructions that, when executed by a processor, cause all The processor executes the method of the first aspect described above.
- the terminal performs two word segmentation operations on the text to be processed instead of performing rough word segmentation on the text to be processed, which can avoid the randomness in the process of rough word segmentation in the prior art and improve the terminal’s word segmentation for the text to be processed Accuracy.
- FIG. 1 is a schematic flowchart of a method for text segmentation provided by an embodiment of the present application
- FIG. 2 is a schematic flowchart of a text word segmentation method provided by another embodiment of the present application.
- 3A is a schematic diagram of multiple individual characters obtained after splitting the text to be processed according to an embodiment of the present application
- Fig. 3B is a schematic diagram of a directed acyclic graph provided by an embodiment of the present application.
- 3C is a schematic diagram of another directed acyclic graph provided by an embodiment of the present application.
- FIG. 4A is a schematic block diagram of a text word segmentation device provided by an embodiment of the present application.
- 4B is a schematic block diagram of another text word segmentation device provided by an embodiment of the present application.
- FIG. 5 is a schematic block diagram of a terminal according to another embodiment of the present application.
- the terminals described in the embodiments of the present application include but are not limited to other portable devices such as mobile phones, laptop computers, or tablet computers with touch-sensitive surfaces (for example, touch screen displays and/or touch pads). It should also be understood that, in some embodiments, the device is not a portable communication device, but a desktop computer with a touch-sensitive surface (e.g., touch screen display and/or touch pad).
- the terminal including a display and a touch-sensitive surface is described.
- the terminal may include one or more other physical user interface devices such as a physical keyboard, mouse, and/or joystick.
- the terminal supports various applications, such as one or more of the following: drawing application, presentation application, word processing application, website creation application, disk burning application, spreadsheet application, game application, telephone application Applications, video conferencing applications, email applications, instant messaging applications, exercise support applications, photo management applications, digital camera applications, digital camera applications, web browsing applications, digital music player applications, and / Or digital video player application.
- applications such as one or more of the following: drawing application, presentation application, word processing application, website creation application, disk burning application, spreadsheet application, game application, telephone application Applications, video conferencing applications, email applications, instant messaging applications, exercise support applications, photo management applications, digital camera applications, digital camera applications, web browsing applications, digital music player applications, and / Or digital video player application.
- Various application programs that can be executed on the terminal can use at least one common physical user interface device such as a touch-sensitive surface.
- a touch-sensitive surface One or more functions of the touch-sensitive surface and corresponding information displayed on the terminal can be adjusted and/or changed between applications and/or within corresponding applications.
- the common physical architecture of the terminal for example, a touch-sensitive surface
- the terminal obtains the to-be-processed text according to the voice signal of the speaking user.
- the terminal first obtains the voice signal of the speaking user, and then converts the obtained voice signal of the speaking user into text information, and obtains the text to be processed from the text information.
- the terminal may use voice recognition technology to convert the voice signal of the speaking user into text information, and then obtain the text to be processed from the text information.
- the terminal may directly receive the text information corresponding to the user's voice signal from the voice recognition device, and obtain the text to be processed from the text information.
- the speaking users involved here may include: users who speak and emit voice signals in the simultaneous translation scene, and/or users who generate voice signals through a terminal, for example, through a microphone or other voice collection devices Receive the voice signal of the speaking user.
- the terminal may obtain the text to be processed according to the text input by the user.
- text entered by users in instant messaging, office documents, and other scenarios For example, text entered by users in instant messaging, office documents, and other scenarios.
- the text to be processed may be "Beijing college students drink imported red wine", or "Nanjing, a southern city”, etc., which is not specifically limited in the embodiment of the application.
- Step S102 Perform word segmentation on the to-be-processed text along the first direction according to the word segmentation strategy for string matching to obtain a first word segmentation result.
- the word segmentation strategy based on string matching to segment the text to be processed along the first direction to obtain the first word segmentation result includes:
- the first character is taken as the current character, and the entry consisting of the current character and the M characters adjacent to it is matched with entries in a preset dictionary library in a matching manner to obtain the beginning of the current character To obtain the first word segmentation result; where M is greater than or equal to 1 and less than or equal to Q, and the Q is the number of characters in the text to be processed.
- the preset expression forms of the dictionary library include but are not limited to those shown in Table 1:
- the weight corresponding to an entry represents the probability of the entry in a specific application scenario, and the greater the weight, the greater the probability of the entry. Then, in the process of determining the word segmentation result of the text to be processed, in the case where the word segmentation result has multiple manifestations, the term with a high weight is selected as the word segmentation result.
- the terminal when determining the word segmentation result of the text "Beijing University students drinking imported red wine” to be processed, the terminal preferably uses "Beijing" as the word segmentation result.
- the preset dictionary library contains as much as possible all the entries that may appear in a specific application scenario. Through this implementation, it is possible to avoid the occurrence of unmatched word segmentation results.
- the entries in the preset dictionary library are arranged in the order of weight.
- the terminal can sort the entries in the preset dictionary library in the order of weight. For example, in a possible In the implementation manner, the terminal arranges the entries in the preset dictionary library in descending order of weight; for another example, in another possible implementation manner, the terminal arranges the entries in the preset dictionary library in accordance with the weight Arrange from small to large.
- Table 2 is an expression form of a preset dictionary library provided by an embodiment of this application, wherein the entries in the preset dictionary library are arranged in descending order of weight.
- the weight corresponding to an entry represents the probability of the entry in a specific application scenario. The greater the weight, the greater the probability of the entry.
- the terminal uses the first character as the current character, and matches the entry composed of the current character and the adjacent M characters with entries in the preset dictionary library that are greater than the preset weight. To get the entry at the beginning of the current word. For example, take the text "Beijing University Students Drinking Imported Red Wine” as an example. For the first character "North”, the preset dictionary library contains two entries starting with the character " ⁇ ", which are "Beijing”.
- the setting of the preset weight is diversified.
- the setting of the preset weight may be different or the same, which is not specifically limited in the embodiment of the present application.
- the entries included in the aforementioned preset dictionary library are different, so that the blindness of the terminal in the matching process can be reduced.
- the expression form of the preset dictionary library can be as shown in Table 3:
- the first direction may be from left to right or from right to left, which is not specifically limited in the embodiment of the present application.
- the first direction is from left to right as an example for description.
- the terminal determines that the first character of the to-be-processed text "Beijing University Students Drinking Imported Red Wine” is " ⁇ ", and uses the Chinese character " ⁇ " as the current character.
- each character in the text to be processed can be used as the current character, and the first segmentation result of the text to be processed can be obtained by repeating the above operations (for example, grouping words, matching).
- the terminal uses the word segmentation method described above to segment the text "Beijing college students drinking imported red wine” in the first direction
- the first word segmentation results obtained are: “Beijing”, “college students”, “drink”, “wine ".
- Step S104 Perform word segmentation on the to-be-processed text in a second direction according to the word segmentation strategy matched by the character string to obtain a second word segmentation result.
- the segmentation of the text to be processed in the second direction according to the word segmentation strategy of the character string matching to obtain the second segmentation result includes:
- the first character is taken as the current character, and the entry consisting of the current character and the M characters adjacent to it is matched with entries in a preset dictionary library in a matching manner to obtain the beginning of the current character To obtain the first word segmentation result; where N is greater than or equal to 1 and less than or equal to Q, and the Q is the number of characters in the text to be processed.
- the second direction involved here may be the same as the first direction or opposite to the first direction.
- the same word segmentation operation is performed twice for the text to be processed, which can avoid randomness in determining the word segmentation result during the word segmentation process. Sex.
- the randomness is reflected in the uncertainty in the terminal matching process when the preset dictionary library contains multiple entries with a certain character as the current character.
- the terminal can obtain the word segmentation result of the to-be-processed text again according to the dynamic programming algorithm.
- the accuracy of the word segmentation result is better than when the first The participle result when the direction is the same as the second direction.
- the following will take the second direction as the opposite direction of the first direction as an example for specific explanation:
- the first direction is from left to right, at this time, the second direction is from right to left.
- the terminal determines that the first character of the to-be-processed text "Beijing university students drink imported red wine” is "wine", and "wine” is the current word.
- each character in the text to be processed can be used as the current character, and the second segmentation result of the text to be processed can be obtained by repeating the above operations (for example, word grouping, matching).
- the second word segmentation results obtained can be: “Beijing”, “college students”, “drink”, “ Red wine”.
- the second word segmentation result can also be: “Peking University”, “Sheng”, “Drink” ", “Red Wine”.
- Step S106 If the first word segmentation result is consistent with the second word segmentation result, output the first word segmentation result or the second word segmentation result.
- the terminal may determine whether the first word segmentation result is consistent with the second word segmentation result by comparing one by one.
- the terminal performs word segmentation along the first direction to process the text "Beijing college students drink imported red wine", and the first word segmentation results obtained are: “Beijing”, “college students”, “drink”, “wine” .
- the terminal performs word segmentation along the second direction to process the text "Beijing University Students Drinking Imported Red Wine”, and the second word segmentation results obtained are: “Beijing”, “College Student", “Drink”, and "Red Wine”.
- the terminal uses a one-by-one comparison to determine that the first word segmentation result is consistent with the second word segmentation result. In this case, the terminal can output either the first word segmentation result or the second word segmentation result.
- the terminal outputs the word segmentation result, which means that the terminal can better understand the sentence meaning of the speaking user.
- the expression form of outputting the word segmentation result may be: the terminal displays the word segmentation result of the text to be processed on the display screen, or the terminal outputs the word segmentation result of the text to be processed when broadcasting by voice. During the voice broadcasting process, There is a pause between each word segmentation result so that users can better understand the word segmentation result.
- the terminal can better determine the economic status of the speaking user based on the word segmentation result (for example, the speaking user can repay the arrears, the speaking user cannot repay the arrears, etc.), and the collector can obtain the user's economic status After the situation, a reasonable decision can be made according to the user's economic situation to improve the collection effect.
- the word segmentation result for example, the speaking user can repay the arrears, the speaking user cannot repay the arrears, etc.
- the terminal performs two word segmentation operations on the text to be processed instead of performing rough word segmentation on the text to be processed, which can avoid the randomness in the implementation of rough word segmentation in the prior art, and can improve the terminal’s word segmentation for the text to be processed Accuracy.
- the number of texts to be processed is often more than one, and more than one in more cases.
- the terminal may segment the second word segmentation result based on the word segmentation result of the first text to be processed, That is, the terminal combines the context (or context) to segment the text to be processed, so as to improve the accuracy of word segmentation of the text to be processed by the terminal.
- the terminal can combine a deep learning algorithm to determine the word segmentation result of the second text to be processed.
- the terminal combining the deep learning algorithm to determine the word segmentation result of the second to-be-processed text may include: training the model according to the first word segmentation result to obtain a trained deep learning model; then, pairing with the trained deep learning model The second text to be processed is processed to obtain the second word segmentation result.
- the deep learning algorithm may include, but is not limited to, a deep learning neural network model (deep neural network, DNN), a long short-term memory network model (LSTM, Long Short-Term Memory), and so on.
- DNN deep learning neural network
- LSTM Long Short-Term Memory
- the LSTM model uses input gates, output gates, forget gates, and cell structures to control the learning and forgetting of historical information, making the model suitable for processing long sequences problem.
- the model determines the word segmentation result of the next text to be processed based on the word segmentation results of N historical texts to be processed in the same application scenario.
- N is a positive integer greater than 0, and there are N consecutive histories. Entries determined by matching in the text to be processed are also applicable to the matching of entries in the next text to be processed. This implementation method can improve the accuracy and efficiency of the word segmentation results.
- step S106 the terminal may also perform step S108.
- step S108 the terminal may also perform step S108.
- the following describes in detail how the embodiment of this application implements word segmentation for the text to be processed in conjunction with the text segmentation method shown in FIG. 2.
- step S108 To elaborate:
- Step S108 If the first word segmentation result is inconsistent with the second word segmentation result, perform word segmentation on the text to be processed through a dynamic programming algorithm to obtain a third word segmentation result.
- the terminal performs word segmentation along the first direction to process the text "Beijing college students drink imported red wine", and the first word segmentation results obtained are: “Beijing”, “college students”, “drink”, “wine” ;
- the terminal performs word segmentation on the processing text "Beijing University Students Drinking Imported Red Wine” in the second direction, and the second word segmentation results obtained are: “Peking University”, “Sheng”, “Drink”, “Red Wine” .
- the terminal uses a one-by-one comparison method to determine that the first word segmentation result is inconsistent with the second word segmentation result. In this case, it means that there is an ambiguous field.
- the terminal uses a dynamic programming algorithm to analyze the above pending text "Beijing college students drink imported red wine” Perform word segmentation to get the third word segmentation result.
- the segmentation of the text to be processed by a dynamic programming algorithm to obtain the third segmentation result includes:
- a directed acyclic graph is constructed according to the relevance of adjacent characters in the multiple individual characters; wherein, the directed acyclic graph includes multiple paths, and each of the multiple paths includes The entry and the weight corresponding to the entry;
- the terminal splits the to-be-processed text to obtain multiple individual characters as shown in FIG. 3A, where each character can represent a node.
- the terminal constructs a directed acyclic graph according to the relevance of adjacent characters among the multiple individual characters.
- the relevance of adjacent characters mentioned here means that two adjacent characters can form an entry.
- the entries that can be composed of the character “ ⁇ ” are: "Beijing”, “Peking University”, and "Beijing University Student”.
- the directed acyclic graph constructed by the terminal on the text to be processed by the dynamic programming algorithm may be as shown in Table 3B.
- the directed acyclic graph includes multiple paths as shown below, and each path includes an entry and the corresponding weight of the entry:
- Path 1 the entries included in Path 1 are: Beijing (4)-college students (5)-drink (5)-import (4)-red wine (6);
- Path 2 The entries included in Path 2 are: Peking University (4)-Health (6)-Drink (5)-Import (4)-Red Wine (6);
- Path 3 The entries included in Path 3 are: Beijing (4)-college students (5)-drink (5)-enter (2)-lipstick (8)-wine (2).
- the terminal After obtaining multiple paths, the terminal determines the weight sum of all entries on each path.
- the terminal determines that the sum of the weights of all the entries on the path 2 is 25; the terminal determines that the sum of the weights of all the entries on the path 3 is 26.
- the terminal After the terminal determines the weight sum of all the entries on each path in the directed acyclic graph, the terminal determines the entry on the path with the smallest weight sum as the third word segmentation result.
- the terminal sequentially compares the weight sum of path 1 with the weight sum of path 2, and the weight sum of path 3, and the terminal determines that the weight sum of path 1 is the minimum of the weight sums of the three paths. Then, in this In this case, the terminal determines the entry on path 1 as the third word segmentation result, that is, the third word segmentation result of the terminal for the above-mentioned text to be processed "Beijing university students drink imported red wine" is: “Beijing", “University students", “ “Drink”, “Red Wine”.
- the terminal determines the word segmentation result of the text to be processed through the dynamic programming algorithm and the minimum path principle, which can avoid Ambiguity fields, which can improve the accuracy of the terminal for word segmentation of the text to be processed.
- the first word segmentation result is all the entries on the first path in the directed acyclic graph after the word to be processed is segmented by the dynamic programming algorithm
- the second word segmentation result is the All entries on the second path in the directed acyclic graph
- the segmentation of the text to be processed by the dynamic programming algorithm to obtain the third segmentation result includes:
- the terminal uses a dynamic programming algorithm to construct a directed acyclic graph of the text to be processed, as shown in Figure 3C.
- the terminal determines that the first word segmentation result is all the entries on path 1 in the directed acyclic graph, and the second word segmentation result is For all the entries on path 2 in the directed acyclic graph, in this case, the terminal determines the third word segmentation result from the first word segmentation result and the second word segmentation result, which can improve the word segmentation efficiency of the terminal for the text to be processed.
- the terminal calculates the weight sum of path 1 in FIG. 3C as: 24; the terminal calculates the weight sum of path 2 in FIG. 3C as: 25.
- the terminal judges that the weight sum of the first word segmentation result is less than the weight sum of the second word segmentation result, and at this time, the terminal outputs the first word segmentation result. That is, the terminal determines that the word segmentation result of the text to be processed is: "Beijing”, “College Student”, “Drink”, and "Red Wine”.
- the terminal's word segmentation efficiency for the text to be processed can also be improved.
- the text segmentation device 40 may include: an acquisition unit 400, a first segmentation unit 402, a second segmentation unit 404, and an output unit 406;
- the obtaining unit 400 is used to obtain the text to be processed
- the first word segmentation unit 402 is configured to segment the to-be-processed text in the first direction according to the word segmentation strategy for string matching to obtain the first word segmentation result;
- the second word segmentation unit 404 is configured to segment the to-be-processed text in the second direction according to the word segmentation strategy of the character string matching to obtain a second word segmentation result;
- the output unit 406 is configured to output the first word segmentation result or the second word segmentation result when the first word segmentation result is consistent with the second word segmentation result.
- the text word segmentation device 40 further includes: a third word segmentation unit 408;
- the third word segmentation unit 408 is configured to perform word segmentation on the to-be-processed text through a dynamic programming algorithm when the first word segmentation result is inconsistent with the second word segmentation result to obtain a third word segmentation result.
- the third word segmentation unit 408 includes: a segmentation unit, a construction unit, a first determination unit, and a second determination unit; wherein,
- the splitting unit is configured to split the to-be-processed text to obtain multiple individual characters
- the construction unit is configured to construct a directed acyclic graph according to the relevance of adjacent characters in the plurality of individual characters; wherein the directed acyclic graph includes multiple paths, and the multiple paths Each path in includes an entry and the weight corresponding to the entry;
- the first determining unit is configured to determine the weight sum of all entries on each path in the directed acyclic graph
- the second determining unit is configured to determine the entry on the path with the smallest weight and as the third word segmentation result.
- the first word segmentation result is all the entries on the first path in the directed acyclic graph after the word segmentation of the text to be processed by a dynamic programming algorithm
- the second word segmentation result is the All the entries on the second path in the acyclic graph
- the third word segmentation unit 408 includes: a third determination unit and a fourth determination unit; wherein,
- the third determining unit is configured to determine the weight sum of all entries on the first path and the weight sum of all entries on the second path respectively;
- the fourth determining unit is configured to determine the first word segmentation result as the third when the weight sum of all entries on the first path is less than the weight sum of all entries on the second path Word segmentation result
- the fourth determining unit is further configured to determine the second word segmentation result as the first when the weight sum of all entries on the first path is greater than the weight sum of all entries on the second path Three participle results.
- the first word segmentation unit 402 includes: a fifth determination unit and a matching unit;
- the fifth determining unit is configured to determine the first character of the text to be processed according to the first direction
- the matching unit is configured to use the first character as the current character, and match the entry consisting of the current character and the adjacent M characters with entries in a preset dictionary database in a matching manner,
- the first word segmentation result is obtained by obtaining the entry at the beginning of the current word; where M is greater than or equal to 1 and less than or equal to Q, and the Q is the number of characters in the text to be processed.
- the entries in the preset dictionary library are arranged in the order of weight; the matching unit is specifically configured to:
- the first character is taken as the current character, and the entry consisting of the current character and the adjacent M characters is matched with entries greater than the preset weight in the preset dictionary library in a matching manner to obtain all State the entry at the beginning of the current character to obtain the first word segmentation result.
- the second direction is the opposite direction of the first direction.
- the number of the text to be processed is at least two, and the at least two texts to be processed belong to the same application scenario, wherein the at least two texts to be processed include a first text to be processed and a second text to be processed.
- the text word segmentation device 40 may further include:
- the processing unit 4010 is configured to, after obtaining the word segmentation result of the first text to be processed, determine the word segmentation result of the second text to be processed according to the word segmentation result of the first text to be processed.
- the terminal performs two word segmentation operations on the text to be processed instead of performing rough word segmentation on the text to be processed, which can avoid the randomness in the process of rough word segmentation in the prior art and improve the terminal’s word segmentation for the text to be processed Accuracy.
- FIG. 5 shows a schematic structural diagram of a terminal provided in an embodiment of the present application.
- the terminal 50 may include a processor 501, a memory 504, and a communication module 505.
- the processor 501, the memory 504, and the communication module 505 may be connected to each other through a bus 506.
- the memory 504 may be a high-speed random access memory (RAM) memory, or a non-volatile memory (non-volatile memory), such as at least one disk memory.
- the memory 504 may also be at least one storage system located far away from the aforementioned processor 501.
- the memory 504 is used to store application program codes, which may include an operating system, a network communication module, a user interface module, and a data processing program.
- the communication module 505 is used to exchange information with external devices; the processor 501 is configured to call the program code, Perform the following steps:
- the processor 501 may also be used for:
- a dynamic programming algorithm is used to segment the text to be processed to obtain a third word segmentation result.
- the processor 501 performs word segmentation on the to-be-processed text through a dynamic programming algorithm to obtain the third word segmentation result, which may include:
- a directed acyclic graph is constructed according to the relevance of adjacent characters in the multiple individual characters; wherein, the directed acyclic graph includes multiple paths, and each of the multiple paths includes The entry and the weight corresponding to the entry;
- the first word segmentation result is all the entries on the first path in the directed acyclic graph after the word segmentation of the text to be processed by the dynamic programming algorithm
- the second word segmentation result is the directed acyclic graph. All entries on the second path in the ring graph;
- the processor 501 performs word segmentation on the to-be-processed text through a dynamic programming algorithm to obtain a third word segmentation result, which may include:
- the processor 501 performs word segmentation on the to-be-processed text in the first direction according to the word segmentation strategy for string matching to obtain the first word segmentation result, which may include:
- the first character is taken as the current character, and the entry consisting of the current character and the M characters adjacent to it is matched with entries in a preset dictionary library in a matching manner to obtain the beginning of the current character To obtain the first word segmentation result; where M is greater than or equal to 1 and less than or equal to Q, and the Q is the number of characters in the text to be processed.
- the entries in the preset dictionary library are arranged in the order of weight; the processor 501 regards the first character as the current character, and compares the current character with the M characters adjacent to it in a matching manner.
- the composed entries are matched with entries in the preset dictionary library to obtain entries at the beginning of the current word to obtain the first word segmentation result, including:
- the first character is taken as the current character, and the entry consisting of the current character and the adjacent M characters is matched with entries greater than the preset weight in the preset dictionary library in a matching manner to obtain all State the entry at the beginning of the current character to obtain the first word segmentation result.
- the second direction is the opposite direction of the first direction.
- the number of the texts to be processed is at least two, and the at least two texts to be processed belong to the same application scenario, wherein the at least two texts to be processed include a first text to be processed and a second text to be processed ,
- the processor 501 may also be used for:
- the word segmentation result of the second text to be processed is determined according to the word segmentation result of the first text to be processed.
- execution steps of the processor in the terminal 50 in the embodiment of the present application can refer to the specific implementation of the terminal operation in the embodiments of FIG. 1 to FIG. 2 in the foregoing method embodiments, and details are not described herein again.
- the terminal 50 may include a mobile phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), a mobile Internet device (Mobile Internet Device, MID), smart wearable devices (such as smart watches, smart bracelets), etc.
- PDA Personal Digital Assistant
- MID mobile Internet Device
- smart wearable devices such as smart watches, smart bracelets
- Various devices that can be used by users are not specifically limited in the embodiment of this application.
- the embodiment of the present application also provides a computer-readable storage medium for storing computer software instructions used by the terminal shown in FIG. 1 to FIG. 2, which includes a program for executing the above method embodiment. By executing the stored program, accurate word segmentation of the text to be processed can be achieved.
- An embodiment of the present application also provides a computer program, the computer program includes program instructions that, when executed by a processor, cause the processor to perform the method of the first aspect (FIG. 1 to FIG. 2).
- the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, optical storage, etc.) containing computer-usable program codes.
- a computer-usable storage media including but not limited to disk storage, optical storage, etc.
- These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
- the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
- These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
- the instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
Disclosed in the embodiments of the present application are a text word segmentation method and apparatus, the method comprising: acquiring a text to be processed; on the basis of a character string matching word segmentation policy, performing word segmentation of the text to be processed along a first direction to obtain first word segmentation results; on the basis of a character string matching word segmentation policy, performing word segmentation of the text to be processed along a second direction to obtain second word segmentation results; and, if the first word segmentation results are consistent with the second word segmentation results, then outputting the first word segmentation results or the second word segmentation results. By means of the present application, word segmentation accuracy for a text to be processed can be achieved.
Description
本申请要求于2019年5月20日提交中国专利局、申请号为:201910423046.7、申请名称为“一种文本分词方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on May 20, 2019, the application number is: 201910423046.7, and the application name is "a method and device for text segmentation", the entire content of which is incorporated into this application by reference in.
本申请涉及自然语言处理技术领域,尤其涉及一种文本分词方法及装置。This application relates to the technical field of natural language processing, and in particular to a method and device for text segmentation.
在语音交互产品普及的时代,语音识别和自然语音处理各自扮演者重要的角色。其中,语音识别是指将语音信号解码成文字信息;自然语言处理是指根据文字信息进行语义解析,获取用户的请求意图,从而满足用户的功能需求。中文分词作为自然语音理解中的重要一步,其准确性直接影响人机交互产品的性能。In an era when voice interactive products are popular, voice recognition and natural voice processing each play an important role. Among them, voice recognition refers to the decoding of voice signals into text information; natural language processing refers to semantic analysis based on text information to obtain the user's request intention, so as to meet the user's functional needs. Chinese word segmentation is an important step in natural speech understanding, and its accuracy directly affects the performance of human-computer interaction products.
所谓分词,是指将句子切分成一个一个单独的词,是将连续的句子按照一定的规范重新组合成词序列的过程。以中文分词技术为例,分词技术的目标就是将一句话切分成一个一个单独的中文词语。The so-called word segmentation refers to the segmentation of sentences into individual words, which is the process of recombining consecutive sentences into word sequences according to certain specifications. Take Chinese word segmentation technology as an example. The goal of word segmentation technology is to segment a sentence into individual Chinese words.
现有技术中,当终端获取到用户的语音信息之后,终端将上述语音信息进行转换,得到待处理文本,然后,终端按照一定的策略将待处理文本中的字符串与预设的字典库中的词条进行匹配,若在预设的字典库中找到某个词条,则意味着匹配成功,此时,获取该词条,进而可以得到该待处理文本的分词结果。然而,在实际应用中,按照一定的策略将待处理文本进行分词的过程中,由于分词过程较为粗糙,具有随机性,导致得到的分词结果不够准确。In the prior art, after the terminal obtains the user’s voice information, the terminal converts the above voice information to obtain the text to be processed. Then, the terminal combines the character string in the text to be processed with a preset dictionary library according to a certain strategy. If an entry is found in the preset dictionary library, it means that the matching is successful. At this time, the entry is obtained, and then the word segmentation result of the text to be processed can be obtained. However, in practical applications, in the process of segmenting the to-be-processed text according to a certain strategy, the segmentation results are not accurate enough due to the roughness and randomness of the segmentation process.
在具体实现中,这里所涉及的分词结果不够准确是指:按照一定的策略将待处理文本进行分词的过程中,存在多种分词方式,不同的分词方式可以产生不同的分词结果,在理想状态下,这多个分词结果中有且只有一个最佳的分词结果。以待处理文本为“南方城市南京”为例,预设的字典库中收集的词条包括:南方、南方城、市、城市、南京,那么,在这种情况下,终端对上述待处理文本的分词结果可以包括:南方城/市/南京;也可以包括:南方/城市/南京,其中,理想状态下的最佳分词结果应该为:南方/城市/南京。In the specific implementation, the inaccuracy of the word segmentation results involved here refers to: in the process of segmenting the text to be processed according to a certain strategy, there are multiple word segmentation methods, and different word segmentation methods can produce different word segmentation results. In an ideal state Next, among these multiple word segmentation results, there is only one best word segmentation result. Taking the text to be processed as "Southern City Nanjing" as an example, the entries collected in the preset dictionary database include: South, Southern City, City, City, Nanjing, then, in this case, the terminal will process the text The word segmentation result of can include: South City/City/Nanjing; it can also include: South/City/Nanjing, and the best word segmentation result in an ideal state should be: South/City/Nanjing.
那么,如何确定多个分词结果中的最佳分词结果,以提高针对待处理文本的分词准确度是人们研究的热点技术问题。Then, how to determine the best word segmentation result among multiple word segmentation results to improve the accuracy of word segmentation for the text to be processed is a hot technical issue that people study.
发明内容Summary of the invention
本申请实施例提供一种文本分词方法及装置,可以提高终端针对待处理文本的分词准确度。The embodiments of the present application provide a text word segmentation method and device, which can improve the accuracy of the terminal for word segmentation of the text to be processed.
第一方面,本申请实施例提供了一种文本分词方法,该方法包括:In the first aspect, an embodiment of the present application provides a method for text segmentation, which includes:
获取待处理文本;Get the text to be processed;
根据字符串匹配的分词策略沿第一方向对所述待处理文本进行分词,得到第一分词结果;Perform word segmentation on the to-be-processed text along the first direction according to the word segmentation strategy for string matching to obtain a first word segmentation result;
根据所述字符串匹配的分词策略沿第二方向对所述待处理文本进行分词,得到第二分词结果;Perform word segmentation on the to-be-processed text in a second direction according to the word segmentation strategy matched by the character string to obtain a second word segmentation result;
若所述第一分词结果与所述第二分词结果一致,输出所述第一分词结果或所述第二分词结果。If the first word segmentation result is consistent with the second word segmentation result, output the first word segmentation result or the second word segmentation result.
第二方面,本申请实施例提供了一种文本分词装置,该文本分词装置包括用于执行上述第一方面的方法的单元。具体地,该文本分词装置包括:In a second aspect, an embodiment of the present application provides a text word segmentation device. The text word segmentation device includes a unit for executing the method of the first aspect. Specifically, the text word segmentation device includes:
获取单元,用于获取待处理文本;The obtaining unit is used to obtain the text to be processed;
第一分词单元,用于根据字符串匹配的分词策略沿第一方向对所述待处理文本进行分词,得到第一分词结果;The first word segmentation unit is configured to segment the to-be-processed text in the first direction according to the word segmentation strategy for string matching to obtain the first word segmentation result;
第二分词单元,用于根据所述字符串匹配的分词策略沿第二方向对所述待处理文本进行分词,得到第二分词结果;The second word segmentation unit is configured to segment the to-be-processed text in the second direction according to the word segmentation strategy matched by the character string to obtain a second word segmentation result;
输出单元,用于在所述第一分词结果与所述第二分词结果一致的情况下,输出所述第一分词结果或所述第二分词结果。The output unit is configured to output the first word segmentation result or the second word segmentation result when the first word segmentation result is consistent with the second word segmentation result.
第三方面,本申请实施例提供了另一种终端,包括处理器,所述处理器被配置用于调用存储的程序指令,执行上述第一方面的方法。In a third aspect, an embodiment of the present application provides another terminal, including a processor configured to call stored program instructions to execute the method of the first aspect.
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行上述第一方面的方法。In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium that stores a computer program. The computer program includes program instructions that, when executed by a processor, cause all The processor executes the method of the first aspect described above.
通过实施本申请实施例,终端对待处理文本进行两次分词操作,而不是对待处理文本进行粗略分词,可以避免现有技术中粗略分词实现过程存在的随机性,可以提高终端针对待处理文本的分词准确度。By implementing the embodiments of this application, the terminal performs two word segmentation operations on the text to be processed instead of performing rough word segmentation on the text to be processed, which can avoid the randomness in the process of rough word segmentation in the prior art and improve the terminal’s word segmentation for the text to be processed Accuracy.
图1是本申请实施例提供的一种文本分词方法的示意流程图;FIG. 1 is a schematic flowchart of a method for text segmentation provided by an embodiment of the present application;
图2是本申请另一实施例提供的一种文本分词方法的示意流程图;FIG. 2 is a schematic flowchart of a text word segmentation method provided by another embodiment of the present application;
图3A是本申请实施例提供的一种对待处理文本进行拆分后得到的多个单独字符的示意图;3A is a schematic diagram of multiple individual characters obtained after splitting the text to be processed according to an embodiment of the present application;
图3B是本申请实施例提供的一种有向无环图的示意图;Fig. 3B is a schematic diagram of a directed acyclic graph provided by an embodiment of the present application;
图3C是本申请实施例提供的另一种有向无环图的示意图;3C is a schematic diagram of another directed acyclic graph provided by an embodiment of the present application;
图4A是本申请实施例提供的一种文本分词装置的示意性框图;4A is a schematic block diagram of a text word segmentation device provided by an embodiment of the present application;
图4B是本申请实施例提供的另一种文本分词装置的示意性框图;4B is a schematic block diagram of another text word segmentation device provided by an embodiment of the present application;
图5是本申请另一实施例提供的一种终端示意性框图。FIG. 5 is a schematic block diagram of a terminal according to another embodiment of the present application.
下面将结合附图,对本申请的实施例进行描述。The embodiments of the present application will be described below in conjunction with the drawings.
具体实现中,本申请实施例中描述的终端包括但不限于诸如具有触摸敏感表面(例如,触摸屏显示器和/或触摸板)的移动电话、膝上型计算机或平板计算机之类的其它便携式设备。还应当理解的是,在某些实施例中,所述设备并非便携式通信设备,而是具有触摸敏感表面(例如,触摸屏显示器和/或触摸板)的台式计算机。In specific implementation, the terminals described in the embodiments of the present application include but are not limited to other portable devices such as mobile phones, laptop computers, or tablet computers with touch-sensitive surfaces (for example, touch screen displays and/or touch pads). It should also be understood that, in some embodiments, the device is not a portable communication device, but a desktop computer with a touch-sensitive surface (e.g., touch screen display and/or touch pad).
在接下来的讨论中,描述了包括显示器和触摸敏感表面的终端。然而,应当理解的是,终端可以包括诸如物理键盘、鼠标和/或控制杆的一个或多个其它物理用户接口设备。In the following discussion, a terminal including a display and a touch-sensitive surface is described. However, it should be understood that the terminal may include one or more other physical user interface devices such as a physical keyboard, mouse, and/or joystick.
终端支持各种应用程序,例如以下中的一个或多个:绘图应用程序、演示应用程序、文字处理应用程序、网站创建应用程序、盘刻录应用程序、电子表格应用程序、游戏应用程序、电话应用程序、视频会议应用程序、电子邮件应用程序、即时消息收发应用程序、锻炼支持应用程序、照片管理应用程序、数码相机应用程序、数字摄影机应用程序、web浏览应用程序、数字音乐播放器应用程序和/或数字视频播放器应用程序。The terminal supports various applications, such as one or more of the following: drawing application, presentation application, word processing application, website creation application, disk burning application, spreadsheet application, game application, telephone application Applications, video conferencing applications, email applications, instant messaging applications, exercise support applications, photo management applications, digital camera applications, digital camera applications, web browsing applications, digital music player applications, and / Or digital video player application.
可以在终端上执行的各种应用程序可以使用诸如触摸敏感表面的至少一个公共物理用户接口设备。可以在应用程序之间和/或相应应用程序内调整和/或改变触摸敏感表面的一个或多个功能以及终端上显示的相应信息。这样,终端的公共物理架构(例如,触摸敏感表面)可以支持具有对用户而言直观且透明的用户界面的各种应用程序。Various application programs that can be executed on the terminal can use at least one common physical user interface device such as a touch-sensitive surface. One or more functions of the touch-sensitive surface and corresponding information displayed on the terminal can be adjusted and/or changed between applications and/or within corresponding applications. In this way, the common physical architecture of the terminal (for example, a touch-sensitive surface) can support various applications with a user interface that is intuitive and transparent to the user.
下面结合图1所示的本申请实施例提供的文本分词方法的流程示意图,具体说明本申请实施例是如何实现针对文本的准确分词的,可以包括但不限于如下步骤:With reference to the schematic flow chart of the text word segmentation method provided by the embodiment of the application shown in FIG. 1, the following specifically describes how the embodiment of the application implements accurate word segmentation for the text, which may include but not limited to the following steps:
S100、获取待处理文本。S100. Obtain the text to be processed.
在其中的一个实现方式中,终端根据讲话用户的语音信号获取所述待处理文本。在这种情况下,终端首先获取讲话用户的语音信号,然后,将获取到的讲话用户的语音信号转换为文本信息,并从该文本信息中获取待处理文本。例如,终端可以采用语音识别技术将讲话用户的语音信号转换为文本信息,然后从该文本信息中获取待处理文本。In one of the implementation manners, the terminal obtains the to-be-processed text according to the voice signal of the speaking user. In this case, the terminal first obtains the voice signal of the speaking user, and then converts the obtained voice signal of the speaking user into text information, and obtains the text to be processed from the text information. For example, the terminal may use voice recognition technology to convert the voice signal of the speaking user into text information, and then obtain the text to be processed from the text information.
在其中的另一个实现方式中,终端可以直接从语音识别装置接收用户的语音信号对应的文本信息,并从该文本信息中获取待处理文本。In another implementation manner, the terminal may directly receive the text information corresponding to the user's voice signal from the voice recognition device, and obtain the text to be processed from the text information.
在实际应用中,这里所涉及的讲话用户可以包括:同声翻译的场景中讲话并发出语音信号的用户、和/或,通过终端产生语音信号的用户等,例如,通过麦克风或其它语音采集器件接收讲话用户的语音信号。In practical applications, the speaking users involved here may include: users who speak and emit voice signals in the simultaneous translation scene, and/or users who generate voice signals through a terminal, for example, through a microphone or other voice collection devices Receive the voice signal of the speaking user.
在本申请的另一种实现方式中,终端可以根据用户输入的文本获取所述待处理文本。例如,用户在即时通讯、办公文档等场景下输入的文本。In another implementation manner of the present application, the terminal may obtain the text to be processed according to the text input by the user. For example, text entered by users in instant messaging, office documents, and other scenarios.
示例性地,待处理文本可以为“北京大学生喝进口红酒”,也可以为“南方城市南京”等等,本申请实施例不作具体限定。Exemplarily, the text to be processed may be "Beijing college students drink imported red wine", or "Nanjing, a southern city", etc., which is not specifically limited in the embodiment of the application.
步骤S102、根据字符串匹配的分词策略沿第一方向对所述待处理文本进行分词,得到第一分词结果。Step S102: Perform word segmentation on the to-be-processed text along the first direction according to the word segmentation strategy for string matching to obtain a first word segmentation result.
具体实现中,所述根据字符串匹配的分词策略沿第一方向对所述待处理文本进行分词,得到第一分词结果,包括:In a specific implementation, the word segmentation strategy based on string matching to segment the text to be processed along the first direction to obtain the first word segmentation result includes:
根据所述第一方向确定所述待处理文本的首个字符;Determine the first character of the text to be processed according to the first direction;
将所述首个字符作为当前字,以匹配方式将所述当前字和与其相邻的M个字符组成的词条与预设的词典库中的词条进行匹配,以获取所述当前字开头的词条,得到所述第一分词结果;其中,M大于等于1且小于等于Q,所述Q为所述待处理文本的字符数量。The first character is taken as the current character, and the entry consisting of the current character and the M characters adjacent to it is matched with entries in a preset dictionary library in a matching manner to obtain the beginning of the current character To obtain the first word segmentation result; where M is greater than or equal to 1 and less than or equal to Q, and the Q is the number of characters in the text to be processed.
以大学生日常生活的应用场景为例,预设的词典库的表现形式包括但不限于表1所示:Taking the application scenarios of college students’ daily life as an example, the preset expression forms of the dictionary library include but are not limited to those shown in Table 1:
表1预设的词典库Table 1 Default dictionary library
词条 | 权重Weights |
北京Beijing | 44 |
北京大学 |
11 |
大学生 |
55 |
进口 |
44 |
红酒 |
66 |
需要说明的是,在具体实现中,词条对应的权重表征该词条在具体应用场景下出现的概率,权重越大,表示该词条出现的概率越大。那么,在确定待处理文本的分词结果过程中,在分词结果存在多种表现形式的情况下,优选权重大的词条作为分词结果。It should be noted that, in specific implementation, the weight corresponding to an entry represents the probability of the entry in a specific application scenario, and the greater the weight, the greater the probability of the entry. Then, in the process of determining the word segmentation result of the text to be processed, in the case where the word segmentation result has multiple manifestations, the term with a high weight is selected as the word segmentation result.
以词条“北京”以及“北京大学”为例,在确定待处理文本“北京大学生喝进口红酒”的分词结果时,终端优选将“北京”作为分词结果中的词条。Taking the terms "Beijing" and "Peking University" as examples, when determining the word segmentation result of the text "Beijing University students drinking imported red wine" to be processed, the terminal preferably uses "Beijing" as the word segmentation result.
在本申请实施例中,预设的词典库中尽可能地包含了某个具体的应用场景下可能会出现的所有词条。通过这一实现方式,可以避免出现无法匹配到准确的分词结果。In the embodiment of the present application, the preset dictionary library contains as much as possible all the entries that may appear in a specific application scenario. Through this implementation, it is possible to avoid the occurrence of unmatched word segmentation results.
在其中一个实施方式中,所述预设的词典库中的词条按照权重的大小顺序进行排列。In one of the embodiments, the entries in the preset dictionary library are arranged in the order of weight.
在实际应用中,在预设的词典库的表现形式如表1所示的情况下,终端可以将预设的词典库中的词条按照权重的大小顺序进行排序,例如,在一种可能的实现方式中,终端将预设的词典库中的词条按照权重从大到小进行排列;又例如,在另一种可能的实现方式中,终端将预设的词典库中的词条按照权重从小到大进行排列。为了便于阐述,表2为本申请实施例提供的一种预设的词典库的表现形式,其中,预设的词典库中的词条按照权重从大到小进行排列。In practical applications, in the case that the preset dictionary library has the form shown in Table 1, the terminal can sort the entries in the preset dictionary library in the order of weight. For example, in a possible In the implementation manner, the terminal arranges the entries in the preset dictionary library in descending order of weight; for another example, in another possible implementation manner, the terminal arranges the entries in the preset dictionary library in accordance with the weight Arrange from small to large. For ease of explanation, Table 2 is an expression form of a preset dictionary library provided by an embodiment of this application, wherein the entries in the preset dictionary library are arranged in descending order of weight.
表2预设的词典库Table 2 Default dictionary library
词条Entry | 权重Weights |
红酒 |
66 |
大学生 |
55 |
北京 |
44 |
进口 |
44 |
北京大学 |
11 |
如前所述,词条对应的权重表征该词条在具体应用场景下出现的概率,权重越大,表示该词条出现的概率越大。那么,在这种情况下,终端将首个字符作为当前字,以匹配方式将当前字与相邻的M个字符组成的词条与预设的词典库中大于预设权重的词条进行匹配,以获取当前字开头的词条。例如,以待处理文本“北京大学生喝进口红酒”为例,对于首个字符“北”来说,预设的词典库中包含以字符“北”开头的两个词条,分别为“北京”以 及“北京大学”,其中,“北京”这一词条对应的权重为4,“北京大学”这一词条对应的权重为1,假设预设的权重为3,终端在确定以字符“北”对应的词条时,直接以匹配的方式在预设的词典库中与“北京”这一词条进行匹配,而不是先匹配“北京大学”,再匹配“北京”(或者,先匹配“北京,再匹配“北京大学””),通过这一实现方式,由于终端可以直接与预设的词典库中大于预设权重的词条进行匹配,以最短的时间确定待处理文本的分词结果,从而可以提高分词过程中的分词效率。As mentioned above, the weight corresponding to an entry represents the probability of the entry in a specific application scenario. The greater the weight, the greater the probability of the entry. Then, in this case, the terminal uses the first character as the current character, and matches the entry composed of the current character and the adjacent M characters with entries in the preset dictionary library that are greater than the preset weight. To get the entry at the beginning of the current word. For example, take the text "Beijing University Students Drinking Imported Red Wine" as an example. For the first character "North", the preset dictionary library contains two entries starting with the character "北", which are "Beijing". And "Peking University", where the term "Beijing" corresponds to a weight of 4, and the term "Peking University" corresponds to a weight of 1. Assuming that the preset weight is 3, the terminal is determined to use the character "北When the entry corresponding to "", directly match the entry "Beijing" in the preset dictionary database in a matching manner, instead of matching "Peking University" and then "Beijing" (or, matching " Beijing, then match "Peking University"). Through this implementation method, since the terminal can directly match the entries in the preset dictionary library that are greater than the preset weight, the word segmentation results of the text to be processed can be determined in the shortest time. Thereby, the efficiency of word segmentation in the process of word segmentation can be improved.
在本申请实施例中,预设权重的设置具有多样化,例如,在不同的应用场景下,预设权重的设置可以不同,也可以相同,本申请实施例不作具体限定。In the embodiment of the present application, the setting of the preset weight is diversified. For example, in different application scenarios, the setting of the preset weight may be different or the same, which is not specifically limited in the embodiment of the present application.
再者,需要说明的是,在不同的应用场景中,上述预设的词典库中收录的词条不同,从而可以减少终端在在匹配的过程中的盲目性。Furthermore, it should be noted that, in different application scenarios, the entries included in the aforementioned preset dictionary library are different, so that the blindness of the terminal in the matching process can be reduced.
例如,在催收的应用场景下,预设的词典库的表现形式可以如表3所示:For example, in the application scenario of collection, the expression form of the preset dictionary library can be as shown in Table 3:
表3催收应用场景下的预设词典库Table 3 The preset dictionary library under the collection application scenario
词条 | 权重Weights |
贷款loan | 66 |
借款 |
33 |
金额 |
22 |
欠owe | 44 |
期限the |
11 |
示例性地,第一方向可以为从左至右,也可以为从右至左,本申请实施例不作具体限定。为了便于阐述,在本申请实施例中,将以第一方向为从左至右为例进行描述。Exemplarily, the first direction may be from left to right or from right to left, which is not specifically limited in the embodiment of the present application. For ease of explanation, in the embodiments of the present application, the first direction is from left to right as an example for description.
在这种情况下,终端确定上述待处理文本“北京大学生喝进口红酒”的首个字符为“北”,并将汉字“北”作为当前字。终端将当前字和与其相邻的M个字符(例如,M=1)进行组词(例如,北京),得到词条,然后查询预设的词典库中是否存在该词条,若预设的词典库中存在该词条,则将该词条确定为分词结果。在实际应用中,待处理文本中的每个字符均可以作为当前字符,重复执行上述操作(例如,组词,匹配)即可得到待处理文本的第一分词结果。例如,终端采用上述所描述的分词方法沿第一方向对待处理文本“北京大学生喝进口红酒”进行分词后,得到的第一分词结果为:“北京”、“大学生”、“喝”、“红酒”。In this case, the terminal determines that the first character of the to-be-processed text "Beijing University Students Drinking Imported Red Wine" is "北", and uses the Chinese character "北" as the current character. The terminal groups the current word and its adjacent M characters (for example, M=1) to form a word (for example, Beijing) to obtain an entry, and then queries whether the entry exists in the preset dictionary database. If the entry exists in the dictionary database, the entry is determined as the word segmentation result. In practical applications, each character in the text to be processed can be used as the current character, and the first segmentation result of the text to be processed can be obtained by repeating the above operations (for example, grouping words, matching). For example, after the terminal uses the word segmentation method described above to segment the text "Beijing college students drinking imported red wine" in the first direction, the first word segmentation results obtained are: "Beijing", "college students", "drink", "wine ".
步骤S104、根据所述字符串匹配的分词策略沿第二方向对所述待处理文本进行分词,得到第二分词结果。Step S104: Perform word segmentation on the to-be-processed text in a second direction according to the word segmentation strategy matched by the character string to obtain a second word segmentation result.
在具体实现中,所述根据所述字符串匹配的分词策略沿第二方向对所述待处理文本进行分词,得到第二分词结果,包括:In a specific implementation, the segmentation of the text to be processed in the second direction according to the word segmentation strategy of the character string matching to obtain the second segmentation result includes:
根据所述第二方向确定所述待处理文本的首个字符;Determine the first character of the text to be processed according to the second direction;
将所述首个字符作为当前字,以匹配方式将所述当前字和与其相邻的M个字符组成的词条与预设的词典库中的词条进行匹配,以获取所述当前字开头的词条,得到所述第一分词结果;其中,N大于等于1且小于等于Q,所述Q为所述待处理文本的字符数量。The first character is taken as the current character, and the entry consisting of the current character and the M characters adjacent to it is matched with entries in a preset dictionary library in a matching manner to obtain the beginning of the current character To obtain the first word segmentation result; where N is greater than or equal to 1 and less than or equal to Q, and the Q is the number of characters in the text to be processed.
在具体实现中,这里所涉及的第二方向可以与第一方向相同,也可以与第一方向相反。In a specific implementation, the second direction involved here may be the same as the first direction or opposite to the first direction.
在其中一个实施方式中,当第二方向与第一方向相同时,在这种情况下,也即针对待处理文本进行两次相同的分词操作,可以避免分词过程中在确定分词结果时的随机性。这 里的随机性体现在,当预设的词典库中包含多个以某个字符为当前字的词条时,终端匹配过程中的不确定性。当终端根据本申请描述的方法获得某一待处理文本的分词结果不一致的情况下,终端可以根据动态规划算法再次获得待处理文本的分词结果。In one of the embodiments, when the second direction is the same as the first direction, in this case, the same word segmentation operation is performed twice for the text to be processed, which can avoid randomness in determining the word segmentation result during the word segmentation process. Sex. The randomness here is reflected in the uncertainty in the terminal matching process when the preset dictionary library contains multiple entries with a certain character as the current character. When the word segmentation result of a certain to-be-processed text obtained by the terminal according to the method described in this application is inconsistent, the terminal can obtain the word segmentation result of the to-be-processed text again according to the dynamic programming algorithm.
作为一种优选的实现方式,当第二方向为第一方向的反方向时,在这种情况下,也即针对待处理文本进行一次回溯操作,其分词结果的准确度要优于当第一方向与第二方向相同时的分词结果。下面将以第二方向为第一方向的反方向为例进行具体阐述:As a preferred implementation, when the second direction is the opposite of the first direction, in this case, that is, a backtracking operation is performed on the text to be processed, the accuracy of the word segmentation result is better than when the first The participle result when the direction is the same as the second direction. The following will take the second direction as the opposite direction of the first direction as an example for specific explanation:
如前所述,第一方向为从左至右,此时,第二方向为从右至左。As mentioned above, the first direction is from left to right, at this time, the second direction is from right to left.
那么,在这种情况下,终端确定上述待处理文本“北京大学生喝进口红酒”的首个字符为“酒”,将“酒”作为当前字。终端将当前字和与其相邻的M个字符(例如,M=1)进行组词,得到词条(例如,红酒),然后查询预设的词典库中是否存在该词条,若预设的词典库中存在该词条,则将该词条确定为分词结果。在实际应用中,待处理文本中的每个字符均可以作为当前字符,重复执行上述操作(例如,组词,匹配)即可得到待处理文本的第二分词结果。例如,终端采用上述所描述的分词方法沿第二方向对待处理文本“北京大学生喝进口红酒”进行分词后,得到的第二分词结果可以为:“北京”、“大学生”、“喝”、“红酒”。又例如,终端采用上述所描述的分词方法沿第二方向对待处理文本“北京大学生喝进口红酒”进行分词后,得到的第二分词结果也可以为:“北京大学”、“生”、“喝”、“红酒”。Then, in this case, the terminal determines that the first character of the to-be-processed text "Beijing university students drink imported red wine" is "wine", and "wine" is the current word. The terminal groups the current word and its adjacent M characters (for example, M=1) to obtain an entry (for example, red wine), and then queries whether the entry exists in the preset dictionary database. If the entry exists in the dictionary database, the entry is determined as the word segmentation result. In practical applications, each character in the text to be processed can be used as the current character, and the second segmentation result of the text to be processed can be obtained by repeating the above operations (for example, word grouping, matching). For example, after the terminal uses the word segmentation method described above to segment the text "Beijing college students drinking imported red wine" in the second direction, the second word segmentation results obtained can be: "Beijing", "college students", "drink", " Red wine". For another example, after the terminal uses the word segmentation method described above to segment the text "Beijing University Students Drink Imported Red Wine" in the second direction, the second word segmentation result can also be: "Peking University", "Sheng", "Drink" ", "Red Wine".
步骤S106、若所述第一分词结果与所述第二分词结果一致,输出所述第一分词结果或所述第二分词结果。Step S106: If the first word segmentation result is consistent with the second word segmentation result, output the first word segmentation result or the second word segmentation result.
在具体实现中,终端可以采用逐一比对的方式判断第一分词结果与第二分词结果是否一致。In a specific implementation, the terminal may determine whether the first word segmentation result is consistent with the second word segmentation result by comparing one by one.
例如,终端根据字符串匹配的分词策略沿第一方向对待处理文本“北京大学生喝进口红酒”进行分词,得到的第一分词结果为:“北京”、“大学生”、“喝”、“红酒”。终端根据字符串匹配的分词策略沿第二方向对待处理文本“北京大学生喝进口红酒”进行分词,得到的第二分词结果为:“北京”、“大学生”、“喝”、“红酒”。在得到上述两个分词结果之后,终端采用逐一比对方式确定第一分词结果和第二分词结果一致,在这种情况下,终端输出第一分词结果或第二分词结果均可。For example, according to the word segmentation strategy of string matching, the terminal performs word segmentation along the first direction to process the text "Beijing college students drink imported red wine", and the first word segmentation results obtained are: "Beijing", "college students", "drink", "wine" . According to the word segmentation strategy of string matching, the terminal performs word segmentation along the second direction to process the text "Beijing University Students Drinking Imported Red Wine", and the second word segmentation results obtained are: "Beijing", "College Student", "Drink", and "Red Wine". After obtaining the above two word segmentation results, the terminal uses a one-by-one comparison to determine that the first word segmentation result is consistent with the second word segmentation result. In this case, the terminal can output either the first word segmentation result or the second word segmentation result.
可以理解的是,在针对待处理文本得到正确的分词结果之后,终端输出该分词结果,也就意味着终端可以更好的了解讲话用户的语句意思。It is understandable that after the correct word segmentation result is obtained for the text to be processed, the terminal outputs the word segmentation result, which means that the terminal can better understand the sentence meaning of the speaking user.
在本申请实施例中,输出分词结果的表现形式可以为:终端通过显示屏显示待处理文本的分词结果,或者,终端通过语音播报的时候输出待处理文本的分词结果,在语音播报过程中,每个分词结果之间存在停顿,以便用户更好的获知分词结果。In the embodiment of this application, the expression form of outputting the word segmentation result may be: the terminal displays the word segmentation result of the text to be processed on the display screen, or the terminal outputs the word segmentation result of the text to be processed when broadcasting by voice. During the voice broadcasting process, There is a pause between each word segmentation result so that users can better understand the word segmentation result.
以催收应用场景为例,终端可以基于分词结果更好的判定讲话用户的经济状况(例如,该讲话用户可以偿还欠款,该讲话用户无法偿还欠款等等),催收者获取到用户的经济状况之后,可以根据用户的经济状况进行合理决策,以提高催收效果。Taking the collection application scenario as an example, the terminal can better determine the economic status of the speaking user based on the word segmentation result (for example, the speaking user can repay the arrears, the speaking user cannot repay the arrears, etc.), and the collector can obtain the user's economic status After the situation, a reasonable decision can be made according to the user's economic situation to improve the collection effect.
通过实施本申请实施例,终端对待处理文本进行两次分词操作,而不是对待处理文本进行粗略分词,可以避免现有技术中粗略分词实现过程存在的随机性,可以提高终端针对待处理文本的分词准确度。By implementing the embodiments of this application, the terminal performs two word segmentation operations on the text to be processed instead of performing rough word segmentation on the text to be processed, which can avoid the randomness in the implementation of rough word segmentation in the prior art, and can improve the terminal’s word segmentation for the text to be processed Accuracy.
需要说明的是,在实际应用中,在同一应用场景下,考虑到应用场景的特殊性,待处 理文本的数量往往不止一个,在更多的情况下为多个。在待处理文本的数量为多个的情况下,例如,待处理文本包括第一待处理文本和第二待处理文本,终端可以基于第一待处理文本的分词结果对第二分词结果进行分词,也即终端结合上下文(或者,前后语境)对待处理文本进行分词,以提高终端针对待处理文本的分词准确度。在具体实现中,终端可以结合深度学习算法来确定第二待处理文本的分词结果。具体来说,终端结合深度学习算法来确定第二待处理文本的分词结果可以包括:根据第一分词结果对模型进行训练,得到训练好的深度学习模型;然后,基于训练好的深度学习模型对第二待处理文本进行处理,得到第二分词结果。It should be noted that, in actual applications, in the same application scenario, considering the particularity of the application scenario, the number of texts to be processed is often more than one, and more than one in more cases. In the case where there are multiple texts to be processed, for example, the text to be processed includes a first text to be processed and a second text to be processed, the terminal may segment the second word segmentation result based on the word segmentation result of the first text to be processed, That is, the terminal combines the context (or context) to segment the text to be processed, so as to improve the accuracy of word segmentation of the text to be processed by the terminal. In a specific implementation, the terminal can combine a deep learning algorithm to determine the word segmentation result of the second text to be processed. Specifically, the terminal combining the deep learning algorithm to determine the word segmentation result of the second to-be-processed text may include: training the model according to the first word segmentation result to obtain a trained deep learning model; then, pairing with the trained deep learning model The second text to be processed is processed to obtain the second word segmentation result.
在本申请实施例中,深度学习算法可以包括但不限于深度学习神经网络模型(deep neural network,DNN)、长短期记忆网络模型(LSTM,Long Short-Term Memory)等。In the embodiment of the present application, the deep learning algorithm may include, but is not limited to, a deep learning neural network model (deep neural network, DNN), a long short-term memory network model (LSTM, Long Short-Term Memory), and so on.
以长短期记忆神经网络模型为例,具体来说,LSTM模型是将输入门、输出门、遗忘门以及细胞(cell)结构,用于控制对历史信息的学习和遗忘,使模型适合处理长序列问题。在本申请实施例中,该模型基于同一应用场景下的N个历史待处理文本的分词结果来确定下一个待处理文本的分词结果,这里,N为大于0的正整数,连续的N个历史待处理文本中通过匹配的方式确定好的词条同样适用于下一个待处理文本的词条匹配,通过这一实现方式,可以提高分词结果的准确性和效率。Take the long and short-term memory neural network model as an example. Specifically, the LSTM model uses input gates, output gates, forget gates, and cell structures to control the learning and forgetting of historical information, making the model suitable for processing long sequences problem. In the embodiment of this application, the model determines the word segmentation result of the next text to be processed based on the word segmentation results of N historical texts to be processed in the same application scenario. Here, N is a positive integer greater than 0, and there are N consecutive histories. Entries determined by matching in the text to be processed are also applicable to the matching of entries in the next text to be processed. This implementation method can improve the accuracy and efficiency of the word segmentation results.
可选的,终端在执行步骤S106之后,终端还可以执行步骤S108,下面结合图2所示的文本分词方法具体阐述本申请实施例是如何实现针对待处理文本的分词的,接下来对步骤S108进行详细阐述:Optionally, after the terminal performs step S106, the terminal may also perform step S108. The following describes in detail how the embodiment of this application implements word segmentation for the text to be processed in conjunction with the text segmentation method shown in FIG. 2. Next, step S108 To elaborate:
步骤S108、若所述第一分词结果与所述第二分词结果不一致,通过动态规划算法对所述待处理文本进行分词,得到第三分词结果。Step S108: If the first word segmentation result is inconsistent with the second word segmentation result, perform word segmentation on the text to be processed through a dynamic programming algorithm to obtain a third word segmentation result.
例如,终端根据字符串匹配的分词策略沿第一方向对待处理文本“北京大学生喝进口红酒”进行分词,得到的第一分词结果为:“北京”、“大学生”、“喝”、“红酒”;终端根据字符串匹配的分词策略沿第二方向对待处理文本“北京大学生喝进口红酒”进行分词,得到的第二分词结果为:“北京大学”、“生”、“喝”、“红酒”。终端采用逐一比对方式确定第一分词结果和第二分词结果不一致,在这种情况下,表示出现了歧义字段,此时,终端通过动态规划算法对上述待处理文本“北京大学生喝进口红酒”进行分词处理,得到第三分词结果。For example, according to the word segmentation strategy of string matching, the terminal performs word segmentation along the first direction to process the text "Beijing college students drink imported red wine", and the first word segmentation results obtained are: "Beijing", "college students", "drink", "wine" ; According to the word segmentation strategy of string matching, the terminal performs word segmentation on the processing text "Beijing University Students Drinking Imported Red Wine" in the second direction, and the second word segmentation results obtained are: "Peking University", "Sheng", "Drink", "Red Wine" . The terminal uses a one-by-one comparison method to determine that the first word segmentation result is inconsistent with the second word segmentation result. In this case, it means that there is an ambiguous field. At this time, the terminal uses a dynamic programming algorithm to analyze the above pending text "Beijing college students drink imported red wine" Perform word segmentation to get the third word segmentation result.
在具体实现中,所述通过动态规划算法对所述待处理文本进行分词,得到第三分词结果,包括:In a specific implementation, the segmentation of the text to be processed by a dynamic programming algorithm to obtain the third segmentation result includes:
对所述待处理文本进行拆分,得到多个单独的字符;Split the to-be-processed text to obtain multiple individual characters;
根据所述多个单独的字符中的相邻字符的关联性构建有向无环图;其中,所述有向无环图中包括多条路径,所述多条路径中的每条路径上包括词条以及所述词条对应的权重;A directed acyclic graph is constructed according to the relevance of adjacent characters in the multiple individual characters; wherein, the directed acyclic graph includes multiple paths, and each of the multiple paths includes The entry and the weight corresponding to the entry;
确定所述有向无环图中每条路径上的所有词条的权重和;Determining the weight sum of all entries on each path in the directed acyclic graph;
将所述权重和最小的路径上的词条确定为所述第三分词结果。Determine the entry on the path with the smallest weight and as the third word segmentation result.
以待处理文本“北京大学生喝进口红酒”为例进行阐述,终端对上述待处理文本进行拆分,可以得到如图3A所示的多个单独字符,其中,每个字符可以表示一个节点。Taking the to-be-processed text "Beijing University Students Drinking Imported Red Wine" as an example, the terminal splits the to-be-processed text to obtain multiple individual characters as shown in FIG. 3A, where each character can represent a node.
之后,终端根据上述多个单独的字符中相邻字符的关联性构建有向无环图。这里所涉 及的相邻字符的关联性是指,两两相邻的字符可以组成词条。以字符“北”为例,字符“北”可以组成的词条有:“北京”、“北京大学”、“北京大学生”。After that, the terminal constructs a directed acyclic graph according to the relevance of adjacent characters among the multiple individual characters. The relevance of adjacent characters mentioned here means that two adjacent characters can form an entry. Taking the character "北" as an example, the entries that can be composed of the character "北" are: "Beijing", "Peking University", and "Beijing University Student".
例如,终端通过动态规划算法对上述待处理文本构建的有向无环图可以如表3B所示。如表3B所示,该有向无环图中包括如下所示的多条路径,每条路径上包括词条以及该词条对应的权重:For example, the directed acyclic graph constructed by the terminal on the text to be processed by the dynamic programming algorithm may be as shown in Table 3B. As shown in Table 3B, the directed acyclic graph includes multiple paths as shown below, and each path includes an entry and the corresponding weight of the entry:
其中,在路径1上包括的词条有:北京(4)--大学生(5)--喝(5)--进口(4)--红酒(6);Among them, the entries included in Path 1 are: Beijing (4)-college students (5)-drink (5)-import (4)-red wine (6);
在路径2上包括的词条有:北京大学(4)--生(6)--喝(5)--进口(4)--红酒(6);The entries included in Path 2 are: Peking University (4)-Health (6)-Drink (5)-Import (4)-Red Wine (6);
在路径3上包括的词条有:北京(4)--大学生(5)--喝(5)--进(2)--口红(8)--酒(2)。The entries included in Path 3 are: Beijing (4)-college students (5)-drink (5)-enter (2)-lipstick (8)-wine (2).
在得到多条路径之后,终端确定上述每条路径上的所有词条的权重和。After obtaining multiple paths, the terminal determines the weight sum of all entries on each path.
以上述路径1为例,终端确定路径1上的所有词条的权重和为4+5+5+4+6=24。Taking the foregoing path 1 as an example, the terminal determines that the sum of the weights of all entries on the path 1 is 4+5+5+4+6=24.
采用上述同样地计算方法,终端确定上述路径2上的所有词条的权重和为:25;终端确定上述路径3上的所有词条的权重和为26。Using the same calculation method described above, the terminal determines that the sum of the weights of all the entries on the path 2 is 25; the terminal determines that the sum of the weights of all the entries on the path 3 is 26.
终端在确定有向无环图中的每条路径上的所有词条的权重和之后,终端将权重和最小的路径上的词条确定为第三分词结果。After the terminal determines the weight sum of all the entries on each path in the directed acyclic graph, the terminal determines the entry on the path with the smallest weight sum as the third word segmentation result.
例如,终端依次将路径1的权重和与路径2的权重和、路径3的权重和进行比较,终端确定路径1的权重和为3个路径中的权重和中的最小值,那么,在这种情况下,终端将路径1上的词条确定为第三分词结果,也即,终端针对上述待处理文本“北京大学生喝进口红酒”的第三分词结果为:“北京”、“大学生”、“喝”、“红酒”。For example, the terminal sequentially compares the weight sum of path 1 with the weight sum of path 2, and the weight sum of path 3, and the terminal determines that the weight sum of path 1 is the minimum of the weight sums of the three paths. Then, in this In this case, the terminal determines the entry on path 1 as the third word segmentation result, that is, the third word segmentation result of the terminal for the above-mentioned text to be processed "Beijing university students drink imported red wine" is: "Beijing", "University students", " "Drink", "Red Wine".
通过实施本申请实施例,在第一分词结果与第二分词结果不一致的情况下,表示出现歧义字段,此时,终端通过动态规划算法以及最小路径原则确定待处理文本的分词结果,可以避免出现歧义字段,从而可以提高终端针对待处理文本的分词准确度。By implementing the embodiments of this application, when the first word segmentation result is inconsistent with the second word segmentation result, it indicates that there is an ambiguous field. At this time, the terminal determines the word segmentation result of the text to be processed through the dynamic programming algorithm and the minimum path principle, which can avoid Ambiguity fields, which can improve the accuracy of the terminal for word segmentation of the text to be processed.
在实际应用中,所述第一分词结果为通过动态规划算法对所述待处理文本进行分词后,有向无环图中第一路径上的所有词条,所述第二分词结果为所述有向无环图中第二路径上的所有词条;In practical applications, the first word segmentation result is all the entries on the first path in the directed acyclic graph after the word to be processed is segmented by the dynamic programming algorithm, and the second word segmentation result is the All entries on the second path in the directed acyclic graph;
所述通过动态规划算法对所述待处理文本进行分词,得到第三分词结果,包括:The segmentation of the text to be processed by the dynamic programming algorithm to obtain the third segmentation result includes:
分别确定所述第一路径上的所有词条的权重和以及所述第二路径上的所有词条的权重和;Respectively determine the weight sum of all the entries on the first path and the weight sum of all the entries on the second path;
若所述第一路径上所有词条的权重和小于所述第二路径上所有词条的权重和,将所述第一分词结果确定为所述第三分词结果;If the weight sum of all entries on the first path is less than the weight sum of all entries on the second path, determining the first word segmentation result as the third word segmentation result;
若否,将所述第二分词结果确定为所述第三分词结果。If not, determine the second word segmentation result as the third word segmentation result.
例如,终端通过动态规划算法对待处理文本构建的有向无环图如图3C所示,终端确定第一分词结果为上述有向无环图中路径1上的所有词条,第二分词结果为上述有向无环图中路径2上的所有词条,在这种情况下,终端在第一分词结果以及第二分词结果中确定第三分词结果,可以提高终端针对待处理文本的分词效率。For example, the terminal uses a dynamic programming algorithm to construct a directed acyclic graph of the text to be processed, as shown in Figure 3C. The terminal determines that the first word segmentation result is all the entries on path 1 in the directed acyclic graph, and the second word segmentation result is For all the entries on path 2 in the directed acyclic graph, in this case, the terminal determines the third word segmentation result from the first word segmentation result and the second word segmentation result, which can improve the word segmentation efficiency of the terminal for the text to be processed.
例如,终端计算图3C中路径1的权重和为:24;终端计算图3C中路径2的权重和为:25。For example, the terminal calculates the weight sum of path 1 in FIG. 3C as: 24; the terminal calculates the weight sum of path 2 in FIG. 3C as: 25.
终端判断第一分词结果的权重和小于第二分词结果的权重和,此时,终端输出第一分词结果。也即,终端确定待处理文本的分词结果为:“北京”、“大学生”、“喝”、“红酒”。The terminal judges that the weight sum of the first word segmentation result is less than the weight sum of the second word segmentation result, and at this time, the terminal outputs the first word segmentation result. That is, the terminal determines that the word segmentation result of the text to be processed is: "Beijing", "College Student", "Drink", and "Red Wine".
通过实施本申请,在提高终端针对待处理文本的分词准确度的同时,还可以提高终端针对待处理文本的分词效率。By implementing this application, while improving the accuracy of the terminal's word segmentation for the text to be processed, the terminal's word segmentation efficiency for the text to be processed can also be improved.
为了便于更好地实施本申请实施例的上述方案,本申请还对应提供了一种文本分词装置,下面结合附图来进行详细说明:In order to facilitate better implementation of the above solutions of the embodiments of this application, this application also provides a corresponding text segmentation device, which will be described in detail below with reference to the accompanying drawings:
如图4A所示的本申请实施例提供的文本分词装置的结构示意图,该文本分词装置40可以包括:获取单元400、第一分词单元402、第二分词单元404、输出单元406;As shown in FIG. 4A, a schematic structural diagram of a text segmentation device provided by an embodiment of the present application, the text segmentation device 40 may include: an acquisition unit 400, a first segmentation unit 402, a second segmentation unit 404, and an output unit 406;
其中,获取单元400,用于获取待处理文本;Wherein, the obtaining unit 400 is used to obtain the text to be processed;
第一分词单元402,用于根据字符串匹配的分词策略沿第一方向对所述待处理文本进行分词,得到第一分词结果;The first word segmentation unit 402 is configured to segment the to-be-processed text in the first direction according to the word segmentation strategy for string matching to obtain the first word segmentation result;
第二分词单元404,用于根据所述字符串匹配的分词策略沿第二方向对所述待处理文本进行分词,得到第二分词结果;The second word segmentation unit 404 is configured to segment the to-be-processed text in the second direction according to the word segmentation strategy of the character string matching to obtain a second word segmentation result;
输出单元406,用于在所述第一分词结果与所述第二分词结果一致时,输出所述第一分词结果或所述第二分词结果。The output unit 406 is configured to output the first word segmentation result or the second word segmentation result when the first word segmentation result is consistent with the second word segmentation result.
可选的,如图4B所示,所述文本分词装置40还包括:第三分词单元408;Optionally, as shown in FIG. 4B, the text word segmentation device 40 further includes: a third word segmentation unit 408;
所述第三分词单元408,用于在所述第一分词结果与所述第二分词结果不一致时,通过动态规划算法对所述待处理文本进行分词,得到第三分词结果。The third word segmentation unit 408 is configured to perform word segmentation on the to-be-processed text through a dynamic programming algorithm when the first word segmentation result is inconsistent with the second word segmentation result to obtain a third word segmentation result.
其中,所述第三分词单元408包括:拆分单元、构建单元、第一确定单元、第二确定单元;其中,Wherein, the third word segmentation unit 408 includes: a segmentation unit, a construction unit, a first determination unit, and a second determination unit; wherein,
所述拆分单元,用于对所述待处理文本进行拆分,得到多个单独的字符;The splitting unit is configured to split the to-be-processed text to obtain multiple individual characters;
所述构建单元,用于根据所述多个单独的字符中的相邻字符的关联性构建有向无环图;其中,所述有向无环图中包括多条路径,所述多条路径中的每条路径上包括词条以及所述词条对应的权重;The construction unit is configured to construct a directed acyclic graph according to the relevance of adjacent characters in the plurality of individual characters; wherein the directed acyclic graph includes multiple paths, and the multiple paths Each path in includes an entry and the weight corresponding to the entry;
所述第一确定单元,用于确定所述有向无环图中每条路径上的所有词条的权重和;The first determining unit is configured to determine the weight sum of all entries on each path in the directed acyclic graph;
所述第二确定单元,用于将所述权重和最小的路径上的词条确定为所述第三分词结果。The second determining unit is configured to determine the entry on the path with the smallest weight and as the third word segmentation result.
可选的,所述第一分词结果为通过动态规划算法对所述待处理文本进行分词后,有向无环图中第一路径上的所有词条,所述第二分词结果为所述有向无环图中第二路径上的所有词条;Optionally, the first word segmentation result is all the entries on the first path in the directed acyclic graph after the word segmentation of the text to be processed by a dynamic programming algorithm, and the second word segmentation result is the All the entries on the second path in the acyclic graph;
所述第三分词单元408包括:第三确定单元和第四确定单元;其中,The third word segmentation unit 408 includes: a third determination unit and a fourth determination unit; wherein,
所述第三确定单元,用于分别确定所述第一路径上的所有词条的权重和以及所述第二路径上的所有词条的权重和;The third determining unit is configured to determine the weight sum of all entries on the first path and the weight sum of all entries on the second path respectively;
所述第四确定单元,用于在所述第一路径上所有词条的权重和小于所述第二路径上所有词条的权重和时,将所述第一分词结果确定为所述第三分词结果;The fourth determining unit is configured to determine the first word segmentation result as the third when the weight sum of all entries on the first path is less than the weight sum of all entries on the second path Word segmentation result
所述第四确定单元,还用于在所述第一路径上所有词条的权重和大于所述第二路径上所有词条的权重和时,将所述第二分词结果确定为所述第三分词结果。The fourth determining unit is further configured to determine the second word segmentation result as the first when the weight sum of all entries on the first path is greater than the weight sum of all entries on the second path Three participle results.
其中,所述第一分词单元402包括:第五确定单元和匹配单元;Wherein, the first word segmentation unit 402 includes: a fifth determination unit and a matching unit;
其中,所述第五确定单元,用于根据所述第一方向确定所述待处理文本的首个字符;Wherein, the fifth determining unit is configured to determine the first character of the text to be processed according to the first direction;
所述匹配单元,用于将所述首个字符作为当前字,以匹配方式将所述当前字和与其相邻的M个字符组成的词条与预设的词典库中的词条进行匹配,以获取所述当前字开头的词条,得到所述第一分词结果;其中,M大于等于1且小于等于Q,所述Q为所述待处理文本的字符数量。The matching unit is configured to use the first character as the current character, and match the entry consisting of the current character and the adjacent M characters with entries in a preset dictionary database in a matching manner, The first word segmentation result is obtained by obtaining the entry at the beginning of the current word; where M is greater than or equal to 1 and less than or equal to Q, and the Q is the number of characters in the text to be processed.
可选的,所述预设的词典库中的词条按照权重的大小顺序进行排列;所述匹配单元具体用于:Optionally, the entries in the preset dictionary library are arranged in the order of weight; the matching unit is specifically configured to:
将所述首个字符作为当前字,以匹配方式将所述当前字和与相邻的M个字符组成的词条与预设的词典库中大于预设权重的词条进行匹配,以获取所述当前字开头的词条,得到所述第一分词结果。The first character is taken as the current character, and the entry consisting of the current character and the adjacent M characters is matched with entries greater than the preset weight in the preset dictionary library in a matching manner to obtain all State the entry at the beginning of the current character to obtain the first word segmentation result.
可选的,所述第二方向为所述第一方向的反方向。Optionally, the second direction is the opposite direction of the first direction.
可选的,所述待处理文本的数量为至少两个,且所述至少两个待处理文本属于同一应用场景,其中,所述至少两个待处理文本包括第一待处理文本和第二待处理文本,所述文本分词装置40还可以包括:Optionally, the number of the text to be processed is at least two, and the at least two texts to be processed belong to the same application scenario, wherein the at least two texts to be processed include a first text to be processed and a second text to be processed. For processing text, the text word segmentation device 40 may further include:
处理单元4010,用于在获取所述第一待处理文本的分词结果后,根据所述第一待处理文本的分词结果确定所述第二待处理文本的分词结果。The processing unit 4010 is configured to, after obtaining the word segmentation result of the first text to be processed, determine the word segmentation result of the second text to be processed according to the word segmentation result of the first text to be processed.
通过实施本申请实施例,终端对待处理文本进行两次分词操作,而不是对待处理文本进行粗略分词,可以避免现有技术中粗略分词实现过程存在的随机性,可以提高终端针对待处理文本的分词准确度。By implementing the embodiments of this application, the terminal performs two word segmentation operations on the text to be processed instead of performing rough word segmentation on the text to be processed, which can avoid the randomness in the process of rough word segmentation in the prior art and improve the terminal’s word segmentation for the text to be processed Accuracy.
为了便于更好地实施本申请实施例的上述方案,本申请还对应提供了另一种终端,下面结合附图来进行详细说明:In order to facilitate better implementation of the above-mentioned solutions of the embodiments of this application, this application also provides another terminal correspondingly, which will be described in detail below with reference to the accompanying drawings:
如图5示出的本申请实施例提供的终端的结构示意图,终端50可以包括处理器501、存储器504和通信模块505,处理器501、存储器504和通信模块505可以通过总线506相互连接。存储器504可以是高速随机存储记忆体(Random Access Memory,RAM)存储器,也可以是非易失性的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器504可选的还可以是至少一个位于远离前述处理器501的存储系统。存储器504用于存储应用程序代码,可以包括操作系统、网络通信模块、用户接口模块以及数据处理程序,通信模块505用于与外部设备进行信息交互;处理器501被配置用于调用该程序代码,执行以下步骤:FIG. 5 shows a schematic structural diagram of a terminal provided in an embodiment of the present application. The terminal 50 may include a processor 501, a memory 504, and a communication module 505. The processor 501, the memory 504, and the communication module 505 may be connected to each other through a bus 506. The memory 504 may be a high-speed random access memory (RAM) memory, or a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory 504 may also be at least one storage system located far away from the aforementioned processor 501. The memory 504 is used to store application program codes, which may include an operating system, a network communication module, a user interface module, and a data processing program. The communication module 505 is used to exchange information with external devices; the processor 501 is configured to call the program code, Perform the following steps:
获取待处理文本;Get the text to be processed;
根据字符串匹配的分词策略沿第一方向对所述待处理文本进行分词,得到第一分词结果;Perform word segmentation on the to-be-processed text along the first direction according to the word segmentation strategy for string matching to obtain a first word segmentation result;
根据所述字符串匹配的分词策略沿第二方向对所述待处理文本进行分词,得到第二分词结果;Perform word segmentation on the to-be-processed text in a second direction according to the word segmentation strategy matched by the character string to obtain a second word segmentation result;
若所述第一分词结果与所述第二分词结果一致,输出所述第一分词结果或所述第二分词结果。If the first word segmentation result is consistent with the second word segmentation result, output the first word segmentation result or the second word segmentation result.
其中,处理器501还可以用于:Among them, the processor 501 may also be used for:
若所述第一分词结果与所述第二分词结果不一致,通过动态规划算法对所述待处理文本进行分词,得到第三分词结果。If the first word segmentation result is inconsistent with the second word segmentation result, a dynamic programming algorithm is used to segment the text to be processed to obtain a third word segmentation result.
其中,处理器501通过动态规划算法对所述待处理文本进行分词,得到第三分词结果,可以包括:Wherein, the processor 501 performs word segmentation on the to-be-processed text through a dynamic programming algorithm to obtain the third word segmentation result, which may include:
对所述待处理文本进行拆分,得到多个单独的字符;Split the to-be-processed text to obtain multiple individual characters;
根据所述多个单独的字符中的相邻字符的关联性构建有向无环图;其中,所述有向无环图中包括多条路径,所述多条路径中的每条路径上包括词条以及所述词条对应的权重;A directed acyclic graph is constructed according to the relevance of adjacent characters in the multiple individual characters; wherein, the directed acyclic graph includes multiple paths, and each of the multiple paths includes The entry and the weight corresponding to the entry;
确定所述有向无环图中每条路径上的所有词条的权重和;Determining the weight sum of all entries on each path in the directed acyclic graph;
将所述权重和最小的路径上的词条确定为所述第三分词结果。Determine the entry on the path with the smallest weight and as the third word segmentation result.
其中,所述第一分词结果为通过动态规划算法对所述待处理文本进行分词后,有向无环图中第一路径上的所有词条,所述第二分词结果为所述有向无环图中第二路径上的所有词条;Wherein, the first word segmentation result is all the entries on the first path in the directed acyclic graph after the word segmentation of the text to be processed by the dynamic programming algorithm, and the second word segmentation result is the directed acyclic graph. All entries on the second path in the ring graph;
处理器501通过动态规划算法对所述待处理文本进行分词,得到第三分词结果,可以包括:The processor 501 performs word segmentation on the to-be-processed text through a dynamic programming algorithm to obtain a third word segmentation result, which may include:
分别确定所述第一路径上的所有词条的权重和以及所述第二路径上的所有词条的权重和;Respectively determine the weight sum of all the entries on the first path and the weight sum of all the entries on the second path;
若所述第一路径上所有词条的权重和小于所述第二路径上所有词条的权重和,将所述第一分词结果确定为所述第三分词结果;If the weight sum of all entries on the first path is less than the weight sum of all entries on the second path, determining the first word segmentation result as the third word segmentation result;
若否,将所述第二分词结果确定为所述第三分词结果。If not, determine the second word segmentation result as the third word segmentation result.
其中,处理器501根据字符串匹配的分词策略沿第一方向对所述待处理文本进行分词,得到第一分词结果,可以包括:Wherein, the processor 501 performs word segmentation on the to-be-processed text in the first direction according to the word segmentation strategy for string matching to obtain the first word segmentation result, which may include:
根据所述第一方向确定所述待处理文本的首个字符;Determine the first character of the text to be processed according to the first direction;
将所述首个字符作为当前字,以匹配方式将所述当前字和与其相邻的M个字符组成的词条与预设的词典库中的词条进行匹配,以获取所述当前字开头的词条,得到所述第一分词结果;其中,M大于等于1且小于等于Q,所述Q为所述待处理文本的字符数量。The first character is taken as the current character, and the entry consisting of the current character and the M characters adjacent to it is matched with entries in a preset dictionary library in a matching manner to obtain the beginning of the current character To obtain the first word segmentation result; where M is greater than or equal to 1 and less than or equal to Q, and the Q is the number of characters in the text to be processed.
其中,所述预设的词典库中的词条按照权重的大小顺序进行排列;处理器501将所述首个字符作为当前字,以匹配方式将所述当前字和与其相邻的M个字符组成的词条与预设的词典库中的词条进行匹配,以获取所述当前字开头的词条,得到所述第一分词结果,包括:Wherein, the entries in the preset dictionary library are arranged in the order of weight; the processor 501 regards the first character as the current character, and compares the current character with the M characters adjacent to it in a matching manner. The composed entries are matched with entries in the preset dictionary library to obtain entries at the beginning of the current word to obtain the first word segmentation result, including:
将所述首个字符作为当前字,以匹配方式将所述当前字和与相邻的M个字符组成的词条与预设的词典库中大于预设权重的词条进行匹配,以获取所述当前字开头的词条,得到所述第一分词结果。The first character is taken as the current character, and the entry consisting of the current character and the adjacent M characters is matched with entries greater than the preset weight in the preset dictionary library in a matching manner to obtain all State the entry at the beginning of the current character to obtain the first word segmentation result.
其中,所述第二方向为所述第一方向的反方向。Wherein, the second direction is the opposite direction of the first direction.
其中,所述待处理文本的数量为至少两个,且所述至少两个待处理文本属于同一应用场景,其中,所述至少两个待处理文本包括第一待处理文本和第二待处理文本,处理器501还可以用于:Wherein, the number of the texts to be processed is at least two, and the at least two texts to be processed belong to the same application scenario, wherein the at least two texts to be processed include a first text to be processed and a second text to be processed , The processor 501 may also be used for:
在获取所述第一待处理文本的分词结果后,根据所述第一待处理文本的分词结果确定所述第二待处理文本的分词结果。After the word segmentation result of the first text to be processed is obtained, the word segmentation result of the second text to be processed is determined according to the word segmentation result of the first text to be processed.
需要说明的是,本申请实施例中的终端50中处理器的执行步骤可参考上述各方法实施例中图1-图2实施例中的终端运行的具体实现方式,这里不再赘述。It should be noted that the execution steps of the processor in the terminal 50 in the embodiment of the present application can refer to the specific implementation of the terminal operation in the embodiments of FIG. 1 to FIG. 2 in the foregoing method embodiments, and details are not described herein again.
在具体实现中,终端50可以包括移动手机、平板电脑、个人数字助理(Personal Digital Assistant,PDA)、移动互联网设备(Mobile Internet Device,MID)、智能穿戴设备(如智能手表、智能手环)等各种用户可以使用的设备,本申请实施例不作具体限定。In specific implementation, the terminal 50 may include a mobile phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), a mobile Internet device (Mobile Internet Device, MID), smart wearable devices (such as smart watches, smart bracelets), etc. Various devices that can be used by users are not specifically limited in the embodiment of this application.
本申请实施例还提供了一种计算机可读存储介质,用于存储为上述图1-图2所示的终端所用的计算机软件指令,其包含用于执行上述方法实施例所涉及的程序。通过执行存储的程序,可以实现针对待处理文本的精准分词。The embodiment of the present application also provides a computer-readable storage medium for storing computer software instructions used by the terminal shown in FIG. 1 to FIG. 2, which includes a program for executing the above method embodiment. By executing the stored program, accurate word segmentation of the text to be processed can be achieved.
本申请实施例还提供了一种计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行上述第一方面(图1-图2)的方法。An embodiment of the present application also provides a computer program, the computer program includes program instructions that, when executed by a processor, cause the processor to perform the method of the first aspect (FIG. 1 to FIG. 2).
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器和光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application can be provided as methods, systems, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, optical storage, etc.) containing computer-usable program codes.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。This application is described with reference to flowcharts and/or block diagrams of methods, equipment (systems), and computer program products according to the embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.
显然,本领域的技术人员可以对本申请进行各种改动和变型而不脱离本申请的精神和范围。这样,倘若本申请的这些修改和变型属于本申请权利要求及其等同技术的范围之内,则本申请也意图包含这些改动和变型在内。Obviously, those skilled in the art can make various changes and modifications to the application without departing from the spirit and scope of the application. In this way, if these modifications and variations of this application fall within the scope of the claims of this application and their equivalent technologies, this application also intends to include these modifications and variations.
Claims (20)
- 一种文本分词方法,其特征在于,包括:A method for text segmentation, characterized in that it includes:获取待处理文本;Get the text to be processed;根据字符串匹配的分词策略沿第一方向对所述待处理文本进行分词,得到第一分词结果;Perform word segmentation on the to-be-processed text along the first direction according to the word segmentation strategy for string matching to obtain a first word segmentation result;根据所述字符串匹配的分词策略沿第二方向对所述待处理文本进行分词,得到第二分词结果;Perform word segmentation on the to-be-processed text in a second direction according to the word segmentation strategy matched by the character string to obtain a second word segmentation result;若所述第一分词结果与所述第二分词结果一致,输出所述第一分词结果或所述第二分词结果。If the first word segmentation result is consistent with the second word segmentation result, output the first word segmentation result or the second word segmentation result.
- 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method of claim 1, wherein the method further comprises:若所述第一分词结果与所述第二分词结果不一致,通过动态规划算法对所述待处理文本进行分词,得到第三分词结果。If the first word segmentation result is inconsistent with the second word segmentation result, a dynamic programming algorithm is used to segment the text to be processed to obtain a third word segmentation result.
- 根据权利要求2所述的方法,其特征在于,所述通过动态规划算法对所述待处理文本进行分词,得到第三分词结果,包括:The method according to claim 2, wherein said segmenting the text to be processed through a dynamic programming algorithm to obtain a third segmentation result comprises:对所述待处理文本进行拆分,得到多个单独的字符;Split the to-be-processed text to obtain multiple individual characters;根据所述多个单独的字符中的相邻字符的关联性构建有向无环图;其中,所述有向无环图中包括多条路径,所述多条路径中的每条路径上包括词条以及所述词条对应的权重;A directed acyclic graph is constructed according to the relevance of adjacent characters in the multiple individual characters; wherein, the directed acyclic graph includes multiple paths, and each of the multiple paths includes The entry and the weight corresponding to the entry;确定所述有向无环图中每条路径上的所有词条的权重和;Determining the weight sum of all entries on each path in the directed acyclic graph;将所述权重和最小的路径上的词条确定为所述第三分词结果。Determine the entry on the path with the smallest weight and as the third word segmentation result.
- 根据权利要求2所述的方法,其特征在于,所述第一分词结果为通过动态规划算法对所述待处理文本进行分词后,有向无环图中第一路径上的所有词条,所述第二分词结果为所述有向无环图中第二路径上的所有词条;The method according to claim 2, wherein the first word segmentation result is all the entries on the first path in the directed acyclic graph after the word segmentation of the text to be processed by a dynamic programming algorithm, so The second word segmentation result is all the entries on the second path in the directed acyclic graph;所述通过动态规划算法对所述待处理文本进行分词,得到第三分词结果,包括:The segmentation of the text to be processed by the dynamic programming algorithm to obtain the third segmentation result includes:分别确定所述第一路径上的所有词条的权重和以及所述第二路径上的所有词条的权重和;Respectively determine the weight sum of all the entries on the first path and the weight sum of all the entries on the second path;若所述第一路径上所有词条的权重和小于所述第二路径上所有词条的权重和,将所述第一分词结果确定为所述第三分词结果。If the weight sum of all entries on the first path is less than the weight sum of all entries on the second path, the first word segmentation result is determined as the third word segmentation result.
- 根据权利要求4所述的方法,其特征在于,所述方法还包括:The method according to claim 4, wherein the method further comprises:若所述第一路径上所有词条的权重和大于所述第二路径上所有词条的权重和,将所述第二分词结果确定为所述第三分词结果。If the weight sum of all entries on the first path is greater than the weight sum of all entries on the second path, the second word segmentation result is determined as the third word segmentation result.
- 根据权利要求1所述的方法,其特征在于,所述根据字符串匹配的分词策略沿第一方向对所述待处理文本进行分词,得到第一分词结果,包括:The method according to claim 1, wherein the segmenting the to-be-processed text along the first direction according to the word segmentation strategy of character string matching to obtain the first segmentation result comprises:根据所述第一方向确定所述待处理文本的首个字符;Determine the first character of the text to be processed according to the first direction;将所述首个字符作为当前字,以匹配方式将所述当前字和与其相邻的M个字符组成的词条与预设的词典库中的词条进行匹配,以获取所述当前字开头的词条,得到所述第一分词结果;其中,M大于等于1且小于等于Q,所述Q为所述待处理文本的字符数量。The first character is taken as the current character, and the entry consisting of the current character and the M characters adjacent to it is matched with entries in a preset dictionary library in a matching manner to obtain the beginning of the current character To obtain the first word segmentation result; where M is greater than or equal to 1 and less than or equal to Q, and the Q is the number of characters in the text to be processed.
- 根据权利要求6所述的方法,其特征在于,所述预设的词典库中的词条按照权重的大小顺序进行排列;所述将所述首个字符作为当前字,以匹配方式将所述当前字和与其相 邻的M个字符组成的词条与预设的词典库中的词条进行匹配,以获取所述当前字开头的词条,得到所述第一分词结果,包括:The method according to claim 6, wherein the entries in the preset dictionary library are arranged in order of weight; the first character is used as the current character, and the The entry consisting of the current word and its adjacent M characters is matched with the entry in the preset dictionary library to obtain the entry at the beginning of the current character to obtain the first word segmentation result, including:将所述首个字符作为当前字,以匹配方式将所述当前字和与相邻的M个字符组成的词条与预设的词典库中大于预设权重的词条进行匹配,以获取所述当前字开头的词条,得到所述第一分词结果。The first character is taken as the current character, and the entry consisting of the current character and the adjacent M characters is matched with entries greater than the preset weight in the preset dictionary library in a matching manner to obtain all State the entry at the beginning of the current character to obtain the first word segmentation result.
- 根据权利要求1-7任一项所述的方法,其特征在于,所述第二方向为所述第一方向的反方向。The method according to any one of claims 1-7, wherein the second direction is the opposite direction of the first direction.
- 根据权利要求1所述的方法,其特征在于,所述待处理文本的数量为至少两个,且所述至少两个待处理文本属于同一应用场景,其中,所述至少两个待处理文本包括第一待处理文本和第二待处理文本,所述方法还包括:The method according to claim 1, wherein the number of the texts to be processed is at least two, and the at least two texts to be processed belong to the same application scenario, wherein the at least two texts to be processed comprise The first text to be processed and the second text to be processed, the method further includes:在获取所述第一待处理文本的分词结果后,根据所述第一待处理文本的分词结果确定所述第二待处理文本的分词结果。After the word segmentation result of the first text to be processed is obtained, the word segmentation result of the second text to be processed is determined according to the word segmentation result of the first text to be processed.
- 一种文本分词装置,其特征在于,包括:A text word segmentation device, characterized in that it comprises:获取单元,用于获取待处理文本;The obtaining unit is used to obtain the text to be processed;第一分词单元,用于根据字符串匹配的分词策略沿第一方向对所述待处理文本进行分词,得到第一分词结果;The first word segmentation unit is configured to segment the to-be-processed text in the first direction according to the word segmentation strategy for string matching to obtain the first word segmentation result;第二分词单元,用于根据所述字符串匹配的分词策略沿第二方向对所述待处理文本进行分词,得到第二分词结果;The second word segmentation unit is configured to segment the to-be-processed text in the second direction according to the word segmentation strategy matched by the character string to obtain a second word segmentation result;输出单元,用于在所述第一分词结果与所述第二分词结果一致时,输出所述第一分词结果或所述第二分词结果。The output unit is configured to output the first word segmentation result or the second word segmentation result when the first word segmentation result is consistent with the second word segmentation result.
- 一种终端,其特征在于,包括处理器,所述处理器被配置用于调用存储的程序指令,其中:A terminal, characterized by comprising a processor configured to call stored program instructions, wherein:所述处理器,用于获取待处理文本;The processor is used to obtain the text to be processed;所述处理器,还用于根据字符串匹配的分词策略沿第一方向对所述待处理文本进行分词,得到第一分词结果;The processor is further configured to perform word segmentation on the to-be-processed text in the first direction according to the word segmentation strategy for string matching to obtain a first word segmentation result;所述处理器,还用于根据所述字符串匹配的分词策略沿第二方向对所述待处理文本进行分词,得到第二分词结果;The processor is further configured to perform word segmentation on the to-be-processed text in a second direction according to the word segmentation strategy matched by the character string to obtain a second word segmentation result;所述处理器,还用于在所述第一分词结果与所述第二分词结果一致的情况下,输出所述第一分词结果或所述第二分词结果。The processor is further configured to output the first word segmentation result or the second word segmentation result when the first word segmentation result is consistent with the second word segmentation result.
- 根据权利要求11所述的终端,其特征在于,所述处理器,还用于:The terminal according to claim 11, wherein the processor is further configured to:在所述第一分词结果与所述第二分词结果不一致的情况下,通过动态规划算法对所述待处理文本进行分词,得到第三分词结果。When the first word segmentation result is inconsistent with the second word segmentation result, the text to be processed is segmented through a dynamic programming algorithm to obtain a third word segmentation result.
- 根据权利要求12所述的终端,其特征在于,所述处理器具体用于:The terminal according to claim 12, wherein the processor is specifically configured to:对所述待处理文本进行拆分,得到多个单独的字符;Split the to-be-processed text to obtain multiple individual characters;根据所述多个单独的字符中的相邻字符的关联性构建有向无环图;其中,所述有向无环图中包括多条路径,所述多条路径中的每条路径上包括词条以及所述词条对应的权重;A directed acyclic graph is constructed according to the relevance of adjacent characters in the multiple individual characters; wherein, the directed acyclic graph includes multiple paths, and each of the multiple paths includes The entry and the weight corresponding to the entry;确定所述有向无环图中每条路径上的所有词条的权重和;Determining the weight sum of all entries on each path in the directed acyclic graph;将所述权重和最小的路径上的词条确定为所述第三分词结果。Determine the entry on the path with the smallest weight and as the third word segmentation result.
- 根据权利要求12所述的终端,其特征在于,所述第一分词结果为通过动态规划算法对所述待处理文本进行分词后,有向无环图中第一路径上的所有词条,所述第二分词结果为所述有向无环图中第二路径上的所有词条;所述处理器还具体用于:The terminal according to claim 12, wherein the first word segmentation result is all the entries on the first path in the directed acyclic graph after the word segmentation of the text to be processed by a dynamic programming algorithm, so The second word segmentation result is all entries on the second path in the directed acyclic graph; the processor is also specifically configured to:分别确定所述第一路径上的所有词条的权重和以及所述第二路径上的所有词条的权重和;Respectively determine the weight sum of all the entries on the first path and the weight sum of all the entries on the second path;在所述第一路径上所有词条的权重和小于所述第二路径上所有词条的权重和的情况下,将所述第一分词结果确定为所述第三分词结果。In the case that the weight sum of all the entries on the first path is less than the weight sum of all the entries on the second path, the first word segmentation result is determined as the third word segmentation result.
- 根据权利要求14所述的终端,其特征在于,所述处理器还用于:The terminal according to claim 14, wherein the processor is further configured to:在所述第一路径上所有词条的权重和大于所述第二路径上所有词条的权重和的情况下,将所述第二分词结果确定为所述第三分词结果。In the case that the weight sum of all entries on the first path is greater than the weight sum of all entries on the second path, the second word segmentation result is determined as the third word segmentation result.
- 根据权利要求11所述的终端,其特征在于,所述处理器具体用于:The terminal according to claim 11, wherein the processor is specifically configured to:根据所述第一方向确定所述待处理文本的首个字符;Determine the first character of the text to be processed according to the first direction;将所述首个字符作为当前字,以匹配方式将所述当前字和与其相邻的M个字符组成的词条与预设的词典库中的词条进行匹配,以获取所述当前字开头的词条,得到所述第一分词结果;其中,M大于等于1且小于等于Q,所述Q为所述待处理文本的字符数量。The first character is taken as the current character, and the entry consisting of the current character and the M characters adjacent to it is matched with entries in a preset dictionary library in a matching manner to obtain the beginning of the current character To obtain the first word segmentation result; where M is greater than or equal to 1 and less than or equal to Q, and the Q is the number of characters in the text to be processed.
- 根据权利要求16所述的终端,其特征在于,所述预设的词典库中的词条按照权重的大小顺序进行排列;所述处理器具体用于:The terminal according to claim 16, wherein the entries in the preset dictionary library are arranged in order of weight; the processor is specifically configured to:将所述首个字符作为当前字,以匹配方式将所述当前字和与相邻的M个字符组成的词条与预设的词典库中大于预设权重的词条进行匹配,以获取所述当前字开头的词条,得到所述第一分词结果。The first character is taken as the current character, and the entry consisting of the current character and the adjacent M characters is matched with entries greater than the preset weight in the preset dictionary library in a matching manner to obtain all State the entry at the beginning of the current character to obtain the first word segmentation result.
- 根据权利要求11-17任一项所述的终端,其特征在于,所述第二方向为所述第一方向的反方向。The terminal according to any one of claims 11-17, wherein the second direction is a direction opposite to the first direction.
- 根据权利要求11所述的终端,其特征在于,所述待处理文本的数量为至少两个,且所述至少两个待处理文本属于同一应用场景,其中,所述至少两个待处理文本包括第一待处理文本和第二待处理文本,所述处理器还用于:The terminal according to claim 11, wherein the number of the text to be processed is at least two, and the at least two texts to be processed belong to the same application scenario, wherein the at least two texts to be processed include For the first text to be processed and the second text to be processed, the processor is further configured to:在获取所述第一待处理文本的分词结果后,根据所述第一待处理文本的分词结果确定所述第二待处理文本的分词结果。After the word segmentation result of the first text to be processed is obtained, the word segmentation result of the second text to be processed is determined according to the word segmentation result of the first text to be processed.
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求1-9任一项所述的方法。A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to execute The method described in any one of 1-9 is required.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910423046.7A CN110222335A (en) | 2019-05-20 | 2019-05-20 | A kind of text segmenting method and device |
CN201910423046.7 | 2019-05-20 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2020232881A1 true WO2020232881A1 (en) | 2020-11-26 |
Family
ID=67821456
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2019/103069 WO2020232881A1 (en) | 2019-05-20 | 2019-08-28 | Text word segmentation method and apparatus |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN110222335A (en) |
WO (1) | WO2020232881A1 (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111274805B (en) * | 2020-01-19 | 2020-11-20 | 上海众言网络科技有限公司 | Method and device for processing suspected words |
CN111523317B (en) * | 2020-03-09 | 2023-04-07 | 平安科技(深圳)有限公司 | Voice quality inspection method and device, electronic equipment and medium |
CN112765963B (en) * | 2020-12-31 | 2024-08-06 | 北京锐安科技有限公司 | Sentence word segmentation method, sentence word segmentation device, computer equipment and storage medium |
CN114065757A (en) * | 2021-11-11 | 2022-02-18 | 东方财富信息股份有限公司 | Word segmentation method, device, system and equipment |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102915299A (en) * | 2012-10-23 | 2013-02-06 | 海信集团有限公司 | Word segmentation method and device |
CN103646018A (en) * | 2013-12-20 | 2014-03-19 | 大连大学 | Chinese word segmentation method based on hash table dictionary structure |
CN105893353A (en) * | 2016-04-20 | 2016-08-24 | 广东万丈金数信息技术股份有限公司 | Word segmentation method and word segmentation system |
CN105975454A (en) * | 2016-04-21 | 2016-09-28 | 广州精点计算机科技有限公司 | Chinese word segmentation method and device of webpage text |
US20190014071A1 (en) * | 2016-10-13 | 2019-01-10 | Tencent Technology (Shenzhen) Company Limited | Network information identification method and apparatus |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107680585B (en) * | 2017-08-23 | 2020-10-02 | 海信集团有限公司 | Chinese word segmentation method, Chinese word segmentation device and terminal |
-
2019
- 2019-05-20 CN CN201910423046.7A patent/CN110222335A/en active Pending
- 2019-08-28 WO PCT/CN2019/103069 patent/WO2020232881A1/en active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102915299A (en) * | 2012-10-23 | 2013-02-06 | 海信集团有限公司 | Word segmentation method and device |
CN103646018A (en) * | 2013-12-20 | 2014-03-19 | 大连大学 | Chinese word segmentation method based on hash table dictionary structure |
CN105893353A (en) * | 2016-04-20 | 2016-08-24 | 广东万丈金数信息技术股份有限公司 | Word segmentation method and word segmentation system |
CN105975454A (en) * | 2016-04-21 | 2016-09-28 | 广州精点计算机科技有限公司 | Chinese word segmentation method and device of webpage text |
US20190014071A1 (en) * | 2016-10-13 | 2019-01-10 | Tencent Technology (Shenzhen) Company Limited | Network information identification method and apparatus |
Also Published As
Publication number | Publication date |
---|---|
CN110222335A (en) | 2019-09-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11934791B2 (en) | On-device projection neural networks for natural language understanding | |
CN110036399B (en) | Neural network data entry system | |
US11693894B2 (en) | Conversation oriented machine-user interaction | |
WO2020232881A1 (en) | Text word segmentation method and apparatus | |
US20220027569A1 (en) | Method for semantic retrieval, device and storage medium | |
WO2021121198A1 (en) | Semantic similarity-based entity relation extraction method and apparatus, device and medium | |
US20230386238A1 (en) | Data processing method and apparatus, computer device, and storage medium | |
AU2014212844B2 (en) | Character and word level language models for out-of-vocabulary text input | |
CN111386686B (en) | Machine reading understanding system for answering queries related to documents | |
CN114861889B (en) | Deep learning model training method, target object detection method and device | |
Suman et al. | Why pay more? A simple and efficient named entity recognition system for tweets | |
US20220147835A1 (en) | Knowledge graph construction system and knowledge graph construction method | |
EP3762876A1 (en) | Intelligent knowledge-learning and question-answering | |
CN111444321B (en) | Question answering method, device, electronic equipment and storage medium | |
CN116796730A (en) | Text error correction method, device, equipment and storage medium based on artificial intelligence | |
CN114781358A (en) | Text error correction method, device, device and storage medium based on reinforcement learning | |
CN113761189A (en) | Method and device for correcting text, computer equipment and storage medium | |
CN115878761B (en) | Event context generation method, device and medium | |
CN111966894A (en) | Information query method and device, storage medium and electronic equipment | |
CN106844357A (en) | Big sentence storehouse interpretation method | |
CN113033205B (en) | Method, device, equipment and storage medium for entity linking | |
CN110377915B (en) | Text emotion analysis method and device, storage medium and equipment | |
Wang et al. | PortraitAI: a deep learning-based approach for generating user portrait for online dating website | |
CN116992875B (en) | Text generation method, apparatus, computer device and storage medium | |
CN114490976B (en) | Method, device, equipment and storage medium for generating dialogue abstract training data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 19929640 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 19929640 Country of ref document: EP Kind code of ref document: A1 |