CN115796176A - Word segmentation processing method, computer device, storage medium, and computer program product - Google Patents

Word segmentation processing method, computer device, storage medium, and computer program product Download PDF

Info

Publication number
CN115796176A
CN115796176A CN202211478975.6A CN202211478975A CN115796176A CN 115796176 A CN115796176 A CN 115796176A CN 202211478975 A CN202211478975 A CN 202211478975A CN 115796176 A CN115796176 A CN 115796176A
Authority
CN
China
Prior art keywords
word
word segmentation
segmentation result
field
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211478975.6A
Other languages
Chinese (zh)
Inventor
董文
崔路男
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Music Entertainment Technology Shenzhen Co Ltd
Original Assignee
Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Music Entertainment Technology Shenzhen Co Ltd filed Critical Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority to CN202211478975.6A priority Critical patent/CN115796176A/en
Publication of CN115796176A publication Critical patent/CN115796176A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The present application relates to a word segmentation processing method, a computer device, a storage medium, and a computer program product. The method comprises the following steps: according to at least two pre-constructed dictionaries, performing word segmentation processing on the text to be segmented respectively to obtain word segmentation results of the text to be segmented in each dictionary; the word stock fields of at least two dictionaries are different; determining the sub-field of the word segmentation result aiming at each word segmentation result, and carrying out field disambiguation processing on the sub-field of the word segmentation result to obtain a disambiguation field of the word segmentation result; performing semantic disambiguation on the disambiguation field of the participle result aiming at each participle result to obtain a target field of the participle result; and fusing the target field of each word segmentation result to obtain a target word segmentation result of the text to be segmented. By adopting the method, the word segmentation accuracy can be improved.

Description

Word segmentation processing method, computer device, storage medium, and computer program product
Technical Field
The present application relates to the field of word segmentation technologies, and in particular, to a word segmentation processing method, a computer device, a storage medium, and a computer program product.
Background
ASCII (American Standard Code for Information exchange Code) encoding is mainly used for displaying modern english and other western european languages. In the field of music, ASCII coding has the characteristics of diversified names and more English conjunctions and short sentences. ASCII encoded text is usually delimited by spaces, and when the ASCII encoded text lacks spaces, a computer device is required to perform word segmentation on the ASCII encoded text to obtain correct ASCII encoded text.
The traditional word segmentation method usually takes the longest substring obtained by matching as a word segmentation result, but the method cannot accurately segment short words and short sentences, so that the accuracy of the word segmentation method is low.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a word segmentation processing method, a computer device, a computer-readable storage medium, and a computer program product capable of improving the word segmentation accuracy.
In a first aspect, the present application provides a segmentation processing method. The method comprises the following steps:
according to at least two pre-constructed dictionaries, performing word segmentation processing on a text to be segmented respectively to obtain word segmentation results of the text to be segmented under each dictionary; the word stock fields of the at least two dictionaries are different;
determining the sub-field of the word segmentation result aiming at each word segmentation result, and carrying out field disambiguation on the sub-field of the word segmentation result to obtain a disambiguation field of the word segmentation result;
performing semantic disambiguation on disambiguation fields of the word segmentation results aiming at each word segmentation result to obtain target fields of the word segmentation results;
and fusing the target field of each word segmentation result to obtain a target word segmentation result of the text to be segmented.
In one embodiment, according to at least two pre-constructed dictionaries, performing word segmentation processing on a text to be segmented respectively to obtain a word segmentation result of the text to be segmented in each dictionary, including:
according to the at least two dictionaries, performing forward word segmentation and backward word segmentation on the text to be word segmented respectively to obtain a forward word segmentation result and a backward word segmentation result of the text to be word segmented in each dictionary;
according to a preset indentation length, respectively carrying out indentation processing on the forward word segmentation result and the reverse word segmentation result under each dictionary to obtain an indentation word corresponding to the forward word segmentation result and an indentation word corresponding to the reverse word segmentation result;
searching target indentation words matched with the indentation words from the at least two dictionaries;
and updating the forward word segmentation result and the reverse word segmentation result of each dictionary according to the target indented words to obtain the word segmentation result of the text to be segmented in each dictionary.
In one embodiment, according to a preset indentation length, indentation processing is respectively performed on the forward segmentation result and the reverse segmentation result in each dictionary to obtain an indentation word corresponding to the forward segmentation result and an indentation word corresponding to the reverse segmentation result, including:
taking the difference value between the word length of the word to be verified in the forward word segmentation result and the reverse word segmentation result and the preset indentation length as the minimum word length, and taking the word length of the word to be verified as the maximum word length to obtain the word length range of the word to be verified;
according to the word length of the word length range of the word to be verified, indentation processing is respectively carried out on the word to be verified in the forward word segmentation result and the reverse word segmentation result, and indentation words of the forward word segmentation result and the reverse word segmentation result under the word length are obtained.
In one embodiment, querying, from the at least two dictionaries, target indentation words matching the respective indentation words comprises:
searching words respectively matched with the indentation words under the length of each word from the at least two dictionaries to obtain at least one candidate word matched with the indentation words;
and according to the word score of each candidate word, screening out words meeting preset word conditions from the at least one candidate word to serve as target indentation words matched with the indentation words.
In one embodiment, for each of the word segmentation results, determining a subfield of the word segmentation result, and performing field disambiguation on the subfield of the word segmentation result to obtain a disambiguated field of the word segmentation result, including:
for each word segmentation result, performing field segmentation processing on the word segmentation result to obtain sub-fields of the word segmentation result;
for each word segmentation result, inputting the sub-field of the word segmentation result into a word segmentation evaluation model to obtain the field score of the sub-field of the word segmentation result;
and aiming at each word segmentation result, screening out the sub-fields meeting preset field conditions from the sub-fields of the word segmentation result according to the field scores of the sub-fields of the word segmentation result, and taking the sub-fields as disambiguation fields of the word segmentation result.
In one embodiment, for each word segmentation result, performing field segmentation processing on the word segmentation result to obtain sub-fields of the word segmentation result, including:
for each word segmentation result, according to the first character and the last character of the forward word segmentation result and the first character and the last character of the backward word segmentation result in the word segmentation result, performing field segmentation processing on the forward word segmentation result and the backward word segmentation result to obtain sub-fields in the forward word segmentation result and sub-fields in the backward word segmentation result; wherein the first character and the tail character of the corresponding sub-field in the forward word segmentation result and the backward word segmentation result are the same;
and using the sub-field in the forward word segmentation result and the sub-field in the backward word segmentation result as the sub-field of the word segmentation result.
In one embodiment, for each of the word segmentation results, performing semantic disambiguation on a disambiguation field of the word segmentation result to obtain a target field of the word segmentation result, including:
for each word segmentation result, carrying out field merging processing on a disambiguation field of the word segmentation result and a context word of the disambiguation field to obtain a merged text of the disambiguation field;
for each word segmentation result, carrying out field merging processing on a candidate field of the word segmentation result and a context word of the candidate field to obtain a merged text of the candidate field; wherein the candidate field is a subfield of subfields of the word segmentation result other than the disambiguation field;
inputting the combined text of the disambiguation field and the combined text of the candidate field into a participle evaluation model respectively to obtain a text score of the combined text of the disambiguation field and a text score of the combined text of the candidate field;
and for each word segmentation result, screening target fields meeting preset text score conditions from the candidate fields and the disambiguation fields according to the text scores of the combined texts of the disambiguation fields and the text scores of the combined texts of the candidate fields.
In one embodiment, fusing the target field of each word segmentation result to obtain a target word segmentation result of the text to be word segmented, including:
splicing the target fields of each word segmentation result to obtain spliced texts of the target fields, wherein the spliced texts are the updated word segmentation results of the texts to be word segmented under each dictionary;
and fusing the updated word segmentation results in each dictionary to obtain a target word segmentation result of the text to be word segmented.
In one embodiment, the fusing the updated word segmentation results in each dictionary to obtain the target word segmentation result of the text to be word segmented includes:
determining the subfields of the updated word segmentation results in each dictionary, and performing field disambiguation on the subfields of the updated word segmentation results to obtain disambiguation fields of texts to be segmented;
performing semantic disambiguation on the disambiguation field of the text to be participled to obtain a target field of the text to be participled;
and splicing the target fields of the text to be participated to obtain a target word segmentation result of the text to be participated.
In a second aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the following steps when executing the computer program:
according to at least two pre-constructed dictionaries, performing word segmentation processing on a text to be segmented respectively to obtain word segmentation results of the text to be segmented in each dictionary; the word stock fields of the at least two dictionaries are different;
determining the sub-field of the word segmentation result aiming at each word segmentation result, and carrying out field disambiguation on the sub-field of the word segmentation result to obtain a disambiguation field of the word segmentation result;
performing semantic disambiguation on disambiguation fields of the word segmentation results aiming at each word segmentation result to obtain target fields of the word segmentation results;
and fusing the target field of each word segmentation result to obtain a target word segmentation result of the text to be word segmented.
In a third aspect, the present application also provides a computer-readable storage medium. The computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of:
according to at least two pre-constructed dictionaries, performing word segmentation processing on a text to be segmented respectively to obtain word segmentation results of the text to be segmented in each dictionary; the word stock fields of the at least two dictionaries are different;
determining the sub-field of the word segmentation result aiming at each word segmentation result, and carrying out field disambiguation on the sub-field of the word segmentation result to obtain a disambiguation field of the word segmentation result;
performing semantic disambiguation on disambiguation fields of the word segmentation results aiming at each word segmentation result to obtain target fields of the word segmentation results;
and fusing the target field of each word segmentation result to obtain a target word segmentation result of the text to be word segmented.
In a fourth aspect, the present application further provides a computer program product. The computer program product comprising a computer program which when executed by a processor performs the steps of:
according to at least two pre-constructed dictionaries, performing word segmentation processing on a text to be segmented respectively to obtain word segmentation results of the text to be segmented in each dictionary; the word stock fields of the at least two dictionaries are different;
determining the sub-field of the word segmentation result aiming at each word segmentation result, and carrying out field disambiguation on the sub-field of the word segmentation result to obtain a disambiguation field of the word segmentation result;
performing semantic disambiguation on disambiguation fields of the word segmentation results aiming at each word segmentation result to obtain target fields of the word segmentation results;
and fusing the target field of each word segmentation result to obtain a target word segmentation result of the text to be word segmented.
According to the word segmentation processing method, the computer equipment, the storage medium and the computer program product, word segmentation processing is respectively carried out on the text to be segmented according to at least two pre-constructed dictionaries, and word segmentation results of the text to be segmented in each dictionary are obtained; the method and the device realize word segmentation processing of the text to be segmented through dictionaries in various different word stock fields so as to improve the word segmentation accuracy of short words and short sentences. And then determining the sub-field of the word segmentation result according to each word segmentation result, and performing field disambiguation on the sub-field of the word segmentation result to obtain a disambiguation field of the word segmentation result, so that disambiguation on the sub-field in each word segmentation result is realized, the problem of continuous word segmentation errors of subsequent sub-fields caused by the word segmentation error of the previous sub-field can be solved, and the word segmentation accuracy of the text to be segmented is further improved. Performing semantic disambiguation on the disambiguation field of the participle result aiming at each participle result to obtain a target field of the participle result; the method can solve the problem that ambiguity occurs in the word segmentation result due to the fact that the word segmentation result is ambiguous because the field is disambiguated and the context information of the text to be segmented is ignored, and greatly improves the word segmentation accuracy rate of the text to be segmented. And fusing the target field of each word segmentation result to obtain a target word segmentation result of the text to be segmented, so that the reasonable fusion of the word segmentation results of the text to be segmented in a plurality of dictionaries is realized, and the word segmentation accuracy is greatly improved.
Drawings
FIG. 1 is a diagram of an application environment of a method for word segmentation processing in one embodiment;
FIG. 2 is a flow diagram illustrating a method for word segmentation processing according to one embodiment;
FIG. 3 is a flowchart illustrating the step of obtaining the segmentation result of the text to be segmented in each dictionary in one embodiment;
FIG. 4 is a flowchart illustrating a method for word segmentation processing according to another embodiment;
FIG. 5 is a flowchart illustrating a method for word segmentation processing in accordance with an alternative embodiment;
FIG. 6 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The word segmentation processing method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 101 communicates with the server 102 via a network. The terminal 101 can receive a text to be participled input or designated by a user; server 102 may broadly refer to a backend system that provides word processing related services. The data storage system may store data that the server 102 needs to process. The data storage system may be integrated on the server 102, or may be located on the cloud or other network server. The terminal 101 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server 102 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.
In one embodiment, a user inputs a text to be participled into a terminal 101, the terminal 101 sends the text to be participled to a server 102, and the server 102 performs participle processing on the text to be participled respectively according to at least two pre-constructed dictionaries to obtain a participle result of the text to be participled in each dictionary; the word stock fields of at least two dictionaries are different; determining the sub-field of the word segmentation result aiming at each word segmentation result, and carrying out field disambiguation processing on the sub-field of the word segmentation result to obtain a disambiguation field of the word segmentation result; performing semantic disambiguation on the disambiguation field of the participle result aiming at each participle result to obtain a target field of the participle result; and fusing the target field of each word segmentation result to obtain a target word segmentation result of the text to be segmented. After obtaining the target word segmentation result of the text to be segmented, the server 102 may also send the target word segmentation result to the terminal 101 for display. Therefore, the execution subject of the above word segmentation processing method may be the server 102.
In one embodiment, the above word segmentation processing method can also be implemented on the basis of the server 102 alone. For example, the server 102 may obtain a text to be participled from the background database, and obtain a target word segmentation result of the text to be participled by executing the word segmentation processing method.
In one embodiment, the above word segmentation processing method may also be implemented on the basis of the terminal 101 alone. For example, after acquiring a text to be participled input by a user, the terminal 101 obtains a target word segmentation result of the text to be participled by executing the word segmentation processing method.
As can be seen from the above, in the present exemplary embodiment, the main body of the word segmentation processing method may be the terminal 101 or the server 102, and may also be applied to a system including the terminal 101 and the server 102, and the word segmentation processing method is implemented through interaction between the terminal 101 and the server 102, which is not limited in this disclosure.
In one embodiment, as shown in fig. 2, a word segmentation processing method is provided, which is described by taking the application of the method to the server in fig. 1 as an example, and includes the following steps:
step S201, according to at least two pre-constructed dictionaries, performing word segmentation processing on the text to be segmented respectively to obtain word segmentation results of the text to be segmented in each dictionary; the lexicon fields to which the at least two dictionaries belong are different.
The at least two pre-constructed dictionaries can be a conventional word dictionary and a thesaurus dictionary, and can also be dictionaries in other word stock fields according to word segmentation requirements; wherein the conventional dictionary is about 5.6w conventional words in the Brown corpus (Brown Cropus), and the thesaurus dictionary is 37.3w de-duplicated words in the thesaurus, and the word frequency of all the words is stored.
The text to be segmented refers to text to be segmented, which is formed by ASCII (American Standard Code for Information exchange Standard Code), for example, text to be segmented, which is formed by english or other western european languages. The text to be participled may also be a text to be participled, which is formed by ASCII encoding in the music field.
Specifically, the server constructs at least two dictionaries in different word stock fields, which can be a conventional word dictionary and a library word dictionary containing connecting words and short sentences in multiple music fields, and then the server obtains at least two pre-constructed dictionaries; more dictionaries can be added to improve the word segmentation accuracy. The method comprises the steps of obtaining a text to be participled, which needs to be participled, and sequentially carrying out word segmentation processing and indentation processing on the text to be participled through at least two dictionaries respectively to obtain word segmentation results of the text to be participled in each dictionary. The music dictionary comprises a plurality of dictionaries, wherein the dictionaries are composed of at least two dictionaries, and short words in the music field in the text to be segmented can be identified more accurately.
Step S202, determining the sub-field of the word segmentation result according to each word segmentation result, and carrying out field disambiguation processing on the sub-field of the word segmentation result to obtain a disambiguation field of the word segmentation result.
Wherein the sub-field refers to a field obtained by segmenting the segmentation result. The disambiguation field refers to a subfield in which a field score obtained through field disambiguation satisfies a preset field condition.
Specifically, the server performs field segmentation processing on the word segmentation result of the text to be segmented in each dictionary, and the word segmentation result in each dictionary comprises a forward word segmentation result and a backward word segmentation result, so that the server can perform field segmentation processing on the forward word segmentation result and the backward word segmentation result respectively, and further obtain a subfield of the forward word segmentation result and a subfield of the backward word segmentation result. And then the server carries out field disambiguation on the sub-field of the forward word segmentation result and the sub-field of the reverse word segmentation result to obtain a disambiguation field of the word segmentation result.
Step S203, aiming at each word segmentation result, carrying out semantic disambiguation on the disambiguation field of the word segmentation result to obtain a target field of the word segmentation result.
And the target field is a word segmentation result under each dictionary obtained by processing the text to be segmented by the pointer.
Specifically, the server combines all disambiguation fields of the word segmentation result aiming at each word segmentation result to obtain a combined text of the disambiguation fields; determining context words of sub-fields of word segmentation results in the combined text according to the combined text of the disambiguation fields; and then combining the sub-fields with the context words to obtain combined texts of the sub-fields, performing semantic disambiguation on the combined texts of the sub-fields, screening out combined texts with relatively more accurate text semantics from the combined texts of the sub-fields of the forward word segmentation results and the combined texts of the sub-fields of the reverse word segmentation results of each word segmentation result, and taking the sub-fields corresponding to the combined texts with relatively more accurate text semantics as target fields of the word segmentation results by the server.
For example, assuming that after the segmentation result of the text to be segmented in the general word dictionary is processed in step S202, the first disambiguation field of the segmentation result is [ ' fabulous ', ' rhythms ', ' of ], the second disambiguation field of the segmentation result is [ ' models ', ' to ', ], and the combined text of the first and second disambiguation fields of the segmentation result is [ ' fabulous ', ' rhythms ', ' of ', ' models ', ' to ' ]. Taking the second disambiguation field of the word segmentation result as an example, assuming that the word segmentation result of the text to be segmented in the common word dictionary comprises a forward word segmentation result and a reverse word segmentation result; wherein the second sub-field of the forward participle result is [ 'models', 'to' ], and the second sub-field of the backward participle result is [ 'modest', 'o' ]; according to the combined text of the disambiguation field [ 'fabulous', 'rhythms', 'of', 'models', 'to' ], the combined text of the second sub-field of the forward participle result [ 'models', 'to' ] can be obtained as [ 'of', 'models', 'to' ], and the combined text of the second sub-field of the backward participle result [ 'model', 'o' ]. The subfields [ 'models', 'to' ] are used as the target fields for the participle result, provided that the text semantics of the merged text [ 'of', 'models', 'to' ] are more accurate than the merged text [ 'of', 'modest', 'o' ].
And S204, fusing the target fields of each word segmentation result to obtain a target word segmentation result of the text to be segmented.
The target word segmentation result refers to a set of words of the corresponding text to be segmented, which are acquired from at least two dictionaries.
Specifically, after the processing of step S202 and step S203 described above, the server has obtained the target field of each word segmentation result. For each word segmentation result, the server carries out splicing processing on the target field of the word segmentation result to obtain a spliced text of the target field; it can be understood that the spliced text is a word segmentation result of the text to be segmented after being updated in each dictionary; and then the server fuses the updated word segmentation results in each dictionary to obtain a target word segmentation result of the text to be segmented.
For example, to verify the effect of the word segmentation processing method provided by the present disclosure, the word segmentation processing method in the present embodiment is compared with the existing word segmentation method, and the experimental results are shown in tables 1 and 2.
TABLE 1
Text to be participled timurakperov
Expected target word segmentation result timur akperov
Existing word segmentation method ['ti','murak','perov']
Word segmentation processing method in this embodiment ['timur','akperov']
TABLE 2
Figure BDA0003959008630000081
Figure BDA0003959008630000091
As can be seen from table 1, the word segmentation processing method in this embodiment can accurately identify 'timur', which has a higher accuracy than The existing word segmentation method, and in combination with table 2, the word segmentation processing method in this embodiment performs The indentation processing after The word segmentation processing, so that The short word 'The' can be accurately identified, which has a higher accuracy than The existing word segmentation method.
According to the word segmentation processing method, word segmentation processing is respectively carried out on a text to be segmented according to at least two pre-constructed dictionaries, and word segmentation results of the text to be segmented under each dictionary are obtained; the method and the device realize word segmentation processing of the text to be segmented through dictionaries in various different word stock fields so as to improve the word segmentation accuracy of short words and short sentences. And then determining the sub-field of the word segmentation result according to each word segmentation result, and performing field disambiguation on the sub-field of the word segmentation result to obtain a disambiguation field of the word segmentation result, so that disambiguation on the sub-field in each word segmentation result is realized, the problem of continuous word segmentation errors of subsequent sub-fields caused by the word segmentation error of the previous sub-field can be solved, and the word segmentation accuracy of the text to be segmented is further improved. Performing semantic disambiguation on the disambiguation field of the participle result aiming at each participle result to obtain a target field of the participle result; the method can solve the problem that ambiguity occurs in the word segmentation result due to the fact that the field disambiguation ignores the context information of the text to be segmented, and greatly improves the word segmentation accuracy of the text to be segmented. And fusing the target field of each word segmentation result to obtain a target word segmentation result of the text to be segmented, so that the reasonable fusion of the word segmentation results of the text to be segmented in a plurality of dictionaries is realized, and the word segmentation accuracy is greatly improved.
In an embodiment, as shown in fig. 3, in the step S201, the word segmentation processing is performed on the text to be word segmented respectively according to at least two pre-constructed dictionaries, so as to obtain the word segmentation result of the text to be word segmented under each dictionary, which specifically includes the following contents:
step S301, according to at least two dictionaries, forward and backward word segmentation processing is respectively carried out on the text to be segmented, and a forward word segmentation result and a backward word segmentation result of the text to be segmented under each dictionary are obtained.
The forward word segmentation result refers to a set of words obtained after the text to be word segmented is subjected to forward word segmentation processing. The reverse word segmentation result refers to a word set obtained after the text to be segmented is subjected to reverse word segmentation processing.
Specifically, the server performs forward and reverse word segmentation processing on the text to be segmented respectively in each dictionary, which may be that the text to be segmented performs forward maximum matching processing and reverse maximum matching processing respectively, that is, the text to be segmented is segmented and matched from left to right according to the length of the word with the maximum length in each dictionary, so that the server obtains a forward word segmentation result of the text to be segmented in each dictionary, and meanwhile, the text to be segmented is segmented and matched from right to left according to the length of the word with the maximum length in each dictionary, so that a reverse word segmentation result of the text to be segmented in each dictionary is obtained. The maximum matching process refers to segmentation and matching in a given direction (e.g., forward and backward) based on the longest word existing in the dictionary.
It should be noted that the maximum length matching is influenced by the direction and the dictionary, and is more beneficial to word segmentation with a longer length, however, the text to be segmented in the present disclosure is a text to be segmented formed by ASCII coding, and there are many short words in ASCII coding, for example, to and the short sentence in the text to be segmented are ignored by the forward segmentation result and the backward segmentation result obtained in step S301, and further processing of the forward segmentation result and the backward segmentation result is required to avoid that the short words or the musical short sentence in the text to be segmented are ignored by the forward segmentation result and the backward segmentation result obtained in each dictionary.
Step S302, according to the preset indentation length, indentation processing is respectively carried out on the forward word segmentation result and the reverse word segmentation result under each dictionary, and indentation words corresponding to the forward word segmentation result and the indentation words corresponding to the reverse word segmentation result are obtained.
Specifically, the server performs indentation processing on the word to be verified in the forward word segmentation result and the word to be verified in the reverse word segmentation result in each dictionary respectively according to the preset indentation length to obtain the indentation word corresponding to each word to be verified in the forward word segmentation result and the reverse word segmentation result. The word to be verified refers to a word needing indentation processing in the word segmentation result so as to verify whether the word is segmented accurately. The server can only carry out indentation processing on all words in the forward word segmentation result and the reverse word segmentation result so as to improve the accuracy rate of the word segmentation result of the text to be segmented, and in addition, the server can also select partial words from the forward word segmentation result and the reverse word segmentation result as words to be verified so as to improve the word segmentation efficiency of the text to be segmented.
Step S303, from at least two dictionaries, a target indentation word matching each indentation word is queried.
And step S304, updating the forward word segmentation result and the reverse word segmentation result under each dictionary according to the target indented word to obtain the word segmentation result of the text to be segmented under each dictionary.
Specifically, the server inquires target indentation words matched with the indentation words from at least two dictionaries; then, updating the words to be verified in the forward word segmentation result and the words to be verified in the reverse word segmentation result into target indented words, and obtaining an updated forward word segmentation result and an updated reverse word segmentation result in each dictionary; and taking the updated forward word segmentation result and the updated reverse word segmentation result as word segmentation results of the text to be word segmented in each dictionary.
For example, there are many short words, short sentences and long single words in ASCII encoded text, and there are cases where the partial codes are the same, such as modesto and models to, the and the themself. Supposing that the text to be segmented input by a user is modesto, and the text expected to be queried by the user is a lyric short sentence mode to, because the maximum matching processing is to segment and match according to the longest word existing in a dictionary, the word obtained after the maximum matching processing is a word modesto which is different from the lyric short sentence expected to be queried by the user; and obtaining a modesto short sentence after the modesto word is subjected to indentation processing, wherein the modesto short sentence is in accordance with the lyric short sentence expected to be inquired by a user.
In the embodiment, forward and reverse word segmentation processing is respectively performed on the text to be word segmented according to at least two dictionaries to obtain a forward word segmentation result and a reverse word segmentation result of the text to be word segmented in each dictionary, so that word segmentation results of the text to be word segmented in various dictionaries can be obtained, and word segmentation processing can be performed on the text to be word segmented more comprehensively and accurately; according to the preset indentation length, indentation processing is respectively carried out on the forward word segmentation result and the reverse word segmentation result under each dictionary to obtain indentation words corresponding to the forward word segmentation result and indentation words corresponding to the reverse word segmentation result; and then searching a target indented word matched with each indented word from at least two dictionaries, updating a forward word segmentation result and a reverse word segmentation result under each dictionary according to the target indented word to obtain a word segmentation result of the text to be segmented under each dictionary, and verifying the forward word segmentation result and the reverse word segmentation result under each dictionary again through indentation processing.
In an embodiment, in the step S302, according to the preset indentation length, indentation processing is respectively performed on the forward segmentation result and the backward segmentation result in each dictionary to obtain an indentation word corresponding to the forward segmentation result and an indentation word corresponding to the backward segmentation result, which specifically includes the following contents: taking the difference between the word length of the word to be verified in the forward word segmentation result and the reverse word segmentation result and a preset indentation length as a minimum word length, and taking the word length of the word to be verified as a maximum word length to obtain a word length range of the word to be verified; according to the length of each word in the word length range of the word to be verified, indentation processing is respectively carried out on the word to be verified in the forward word segmentation result and the reverse word segmentation result, and indentation words of the forward word segmentation result and the reverse word segmentation result under the length of each word are obtained.
The preset indentation length refers to a preset parameter for determining the indentation range of the word to be verified.
Specifically, the server determines words to be verified in the forward word segmentation result and the reverse word segmentation result, and obtains the word length of each word to be verified; taking the difference value between the word length of the word to be verified and the preset indentation length as the minimum word length, and taking the word length of the word to be verified as the maximum word length; the word length range of the word to be verified is formed by the minimum word length and the maximum word length. For example, assuming that the word to be verified is a darks, the maximum word length is 5; assuming that the preset indentation length is 2, the minimum word length is 5-2=3, and the word length range of the word to be verified is [3,5].
The server carries out indentation processing on each word to be verified in the forward word segmentation result and the reverse word segmentation result in sequence according to each word length in the word length range to obtain indentation words of the word to be verified under each word length; wherein, the length of each word in the word length range is an integer. For example, if the word length range is [3,5], the word lengths in the word length range include 3, 4 and 5, and the server performs indentation on the word to be verified according to the word length 3 to obtain at least one indented word, then performs indentation on the word to be verified according to the word length 4 to obtain at least one indented word, and performs indentation on the word to be verified according to the word length 5 to obtain at least one indented word.
In the embodiment, according to the word length within the word length range of the word to be verified, the word to be verified in the forward word segmentation result and the word to be verified in the reverse word segmentation result are indented to obtain the indented word of the forward word segmentation result and the indented word of the reverse word segmentation result under each word length, so that the indented word of the word to be verified in the forward word segmentation result and the word to be verified in the reverse word segmentation result under each dictionary is realized, short words and short sentences in the text to be segmented are favorably recognized, the defect that the short words and short sentences are easily ignored by maximum matching processing in the conventional technology is overcome, and the word segmentation accuracy of the short words and the short sentences in the text to be segmented is improved.
In one embodiment, the searching for target indentation words matching the respective indentation words from at least two dictionaries comprises: searching words respectively matched with the indentation words under the length of each word from at least two dictionaries to obtain at least one candidate word matched with the indentation words; and according to the word scores of the candidate words, screening out words meeting the preset word condition from at least one candidate word to serve as target indentation words matched with the indentation words.
The target indentation word refers to an indentation word with a word score meeting a preset word condition.
Specifically, the server queries words matched with the various indentation words from at least two dictionaries to obtain at least one candidate word of the word to be verified. For example, assuming that the word to be verified is dark and the preset indentation length is 2, the word length range of dark is obtained as [3,5], the indented words of dark are dark, dark and dark, because dark is already a word obtained from a dictionary, dark is also one of the candidate words, dar and dark also need to inquire whether there are matching words in at least two dictionaries, if dar does not exist in the dictionary and dark exists in the dictionary, dark is taken as one of the candidate words, and s is taken as the beginning of the next indented word to inquire each dictionary again.
And when the candidate word matched with the indentation word acquired by the server is one, taking the candidate word as a target candidate word. When the number of the candidate words matched with the indented word acquired by the server is two or more, the server evaluates each candidate word, and may be to input each candidate word into a word evaluation model to obtain a word score of each candidate word; and screening out words meeting preset word conditions from the candidate words according to the word scores of the candidate words to serve as target indentation words matched with the indentation words. The preset word condition may be that a word with the highest word score is screened out from at least one candidate word, and the screened word is used as a target indentation word corresponding to the word to be verified. For example, the server obtains the candidate words of dark and dark, evaluates dark and dark to obtain dark and dark word scores of 95 and 80, and uses dark as the target indented word.
In the embodiment, words respectively matched with indentation words under the length of each word are inquired from at least two dictionaries to obtain at least one candidate word matched with the indentation words; and according to the word scores of the candidate words, words meeting the preset word condition are screened from at least one candidate word and serve as target indented words matched with the indented words, the words to be verified in the forward word segmentation result and the reverse word segmentation result in each dictionary are verified again, the defect that short words are easy to ignore in maximum matching processing is overcome, and therefore the word segmentation accuracy of the short words and the short sentences in the text to be segmented is improved.
In an embodiment, in step S202, for each word segmentation result, determining a subfield of the word segmentation result, and performing field disambiguation on the subfield of the word segmentation result to obtain a disambiguation field of the word segmentation result, which specifically includes the following contents: for each word segmentation result, performing field segmentation processing on the word segmentation result to obtain sub-fields of the word segmentation result; inputting the sub-field of the word segmentation result into a word segmentation evaluation model aiming at each word segmentation result to obtain the field score of the sub-field of the word segmentation result; and aiming at each word segmentation result, screening out the sub-fields meeting the preset field condition from the sub-fields of the word segmentation result according to the field scores of the sub-fields of the word segmentation result, and taking the sub-fields as disambiguation fields of the word segmentation result.
The term evaluation model refers to a model for evaluating a score of a field. The segmentation evaluation model can be an n-gram language model, but can also be other language models.
Specifically, the server performs field segmentation processing on the segmentation result for each segmentation result to obtain sub-fields of the segmentation result. The server inputs the sub-field of each word segmentation result into a word segmentation evaluation model, and then the word segmentation evaluation model obtains a word group corresponding to the sub-field according to the number of preset words; in the multiple subfields obtained by the field segmentation processing, if the number of words included in at least one subfield is less than 3, the preset number of words may be set to 2, and otherwise, the preset number of words may be set to 3. And determining the word frequency of the word group corresponding to the subfield according to at least two dictionaries. Obtaining a probability function of the subfield according to the word frequency of the word group corresponding to the subfield; then the word segmentation evaluation model obtains the field scores of the sub-fields according to the product processing result of the probability functions of the sub-fields and the preset word number; the word segmentation evaluation model can be represented as follows:
Figure BDA0003959008630000131
wherein Score represents a field Score of the subfield; n represents the number of preset words; t represents the t-th phrase in the subfield; v (t) represents a probability function.
The probability function may be a probability function with a penalty, and the probability function v (t) may be represented as follows:
Figure BDA0003959008630000132
wherein p (t) represents the word frequency of the phrase corresponding to the subfield, namely the number of times the phrase appears in at least two dictionaries; for example, the segmentation evaluation model may be an n-gram language model, and p (t) may be a bigram frequency or a triplet frequency of the n-gram language model, where the preset number of words n represents two or three.
Further, the server selects the sub-fields meeting the preset field condition from the sub-fields of the word segmentation result according to the field scores of the sub-fields of the word segmentation result as disambiguation fields of the word segmentation result. Wherein the preset field condition is a field score judgment condition set for the disambiguation field. For example, the preset field condition may be set to have the highest field score, and assuming that the field score of the subfield of the forward participle result in the participle result is 88 and the field score of the subfield of the backward participle result in the participle result is 92, the subfield of the backward participle result may be used as a disambiguation field of the participle result.
In this embodiment, for each word segmentation result, performing field segmentation processing on the word segmentation result to obtain sub-fields of the word segmentation result; inputting the sub-field of the word segmentation result into a word segmentation evaluation model to obtain the field score of the sub-field of the word segmentation result; and then according to the field scores of the sub-fields of the word segmentation result, the sub-fields meeting the preset field conditions are screened out from the sub-fields of the word segmentation result to be used as disambiguation fields of the word segmentation result, so that the field disambiguation of the sub-fields in the word segmentation result is realized, the word segmentation error can be controlled in the current sub-field, the influence on the word segmentation of the next sub-field due to the word segmentation error of the current sub-field is avoided, and the robustness of word segmentation processing is improved.
In an embodiment, for each word segmentation result, performing field segmentation processing on the word segmentation result to obtain sub-fields of the word segmentation result, which specifically includes the following contents: for each word segmentation result, performing field segmentation processing on the forward word segmentation result and the reverse word segmentation result according to the first character and the last character of the forward word segmentation result and the first character and the last character of the reverse word segmentation result in the word segmentation result to obtain a sub-field in the forward word segmentation result and a sub-field in the reverse word segmentation result; the first character and the tail character of the corresponding sub-field in the forward word segmentation result and the reverse word segmentation result are the same; and using the sub-field in the forward word segmentation result and the sub-field in the backward word segmentation result as the sub-field of the word segmentation result.
Wherein, the first character refers to the first character of the word in the segmentation result, and the tail character refers to the last character of the word in the segmentation result; for example, the first character of the word fabulous is f and the last character is s.
Specifically, the server performs field segmentation processing on the forward segmentation result and the backward segmentation result according to a first character and a last character of the forward segmentation result and a first character and a last character of the backward segmentation result in the segmentation result for each segmentation result, which may be determining a word with the same first character in the forward segmentation result and the backward segmentation result as a starting word of a first subfield, determining a word with a different first last character encountered after the starting word, and taking a last word of a word with a different first last character as an end word of the first subfield, so that the server obtains the first subfield in the forward segmentation result and the backward segmentation result respectively. Then, the server takes the next word of the last word of the first subfield as the start word of the second subfield, determines the word with the same first end character encountered after the start word of the second subfield as the last word of the second subfield, and then the server obtains the second subfield of the forward word segmentation result and the backward word segmentation result. Similarly, the server obtains all subfields in the forward word segmentation result and the reverse word segmentation result. As can be seen from the above, the sub-fields contain all words starting from the start word and ending at the end word.
For example, a sentence which a user needs to query is fabulous rhythms of modesto, but a text to be participled input by the user is fabulous rhythms of modesto, the server obtains the text to be participled, and after processing, the forward participle result of the text to be participled in a thesaurus dictionary is [ 'fabulous', 'rhythms', 'ofm', 'odes', 'to' ], and the reverse participle result is [ 'fabulous', 'rhythm', 'sof', 'modesto' ]; since the word with the same first character in the forward and reverse word segmentation results is 'fabulous', and the words with different first end characters are 'rhythms' and 'rhythm', the first sub-field of the forward word segmentation result is [ 'fabulous' ], and the first sub-field of the reverse word segmentation result is [ 'fabulous' ]. Then starting from 'rhythms' and 'rhythm', respectively, the words with the same first tail character are 'to' and 'modesto', and the second sub-fields of the forward segmentation result are [ 'rhythms', 'ofm', 'odes', 'to' ], and the second sub-fields of the reverse segmentation result are [ 'rhythm', 'sof', 'modesto' ].
In this embodiment, the sub-field of each word segmentation result is determined according to the first character and the last character of each word in each word segmentation result, so that the first character and the last character of the corresponding sub-field in the forward word segmentation result and the reverse word segmentation result are the same, and a disambiguation field or a target field is subsequently screened out from the corresponding sub-field in the forward word segmentation result and the reverse word segmentation result, thereby improving the robustness of subsequent step processing.
In an embodiment, in step S203, for each participle result, performing semantic disambiguation on the disambiguation field of the participle result to obtain a target field of the participle result, which specifically includes the following contents: for each word segmentation result, carrying out field merging processing on the disambiguation field of the word segmentation result and the context word of the disambiguation field to obtain a merged text of the disambiguation field; for each word segmentation result, carrying out field merging processing on the candidate fields of the word segmentation result and the context words of the candidate fields to obtain merged texts of the candidate fields; wherein the candidate field is a subfield of the participle result except for the disambiguation field; respectively inputting the combined text of the disambiguation field and the combined text of the candidate field into the participle evaluation model to obtain a text score of the combined text of the disambiguation field and a text score of the combined text of the candidate field; and aiming at each word segmentation result, screening the candidate field and the disambiguation field to obtain a target field meeting a preset text score condition according to the text score of the combined text of the disambiguation field and the text score of the combined text of the candidate field.
The word segmentation evaluation model and the word segmentation evaluation model mentioned in the embodiment can be the same model and implemented by the formula mentioned in the embodiment, and the preset number of words in the word segmentation evaluation model can be the same or different; of course, the word segmentation evaluation model in the present embodiment may be different from the word segmentation evaluation model mentioned in the above embodiments.
Where context words refer to the last word(s) and the next word(s) of a field in the merged text.
Specifically, for each word segmentation result, the server performs field merging processing on the disambiguation field of the word segmentation result and the context word of the disambiguation field to obtain a merged text of the disambiguation field. The server takes the subfields of the word segmentation result except the disambiguation field as candidate fields, for example, if the first subfield of the forward word segmentation result in the word segmentation result is the disambiguation field, the first subfield of the reverse word segmentation result in the word segmentation result is the candidate field; the context word of the candidate field may be a context word of a disambiguation field corresponding to the candidate field in the merged text, so that the server performs field merging processing on the candidate field of the participle result and the context word of the candidate field, and may be that the server performs field merging processing on the candidate field and the context word of the disambiguation field corresponding to the candidate field in the merged text to obtain the merged text of the candidate field (hereinafter, the merged text of the candidate field may be simply referred to as the candidate text, and the merged text of the disambiguation field may be simply referred to as the disambiguation text).
The server respectively inputs the disambiguation text and the candidate text into the word segmentation evaluation model, and then the word segmentation evaluation model obtains a word group corresponding to the disambiguation text and a word group corresponding to the candidate text according to the preset number of words; determining the word frequency of the word group of the candidate text and the word frequency of the word group of the disambiguation text according to at least two dictionaries; then obtaining a probability function of the candidate text according to the word frequency of the word group of the candidate text, and obtaining a probability function of the disambiguation combined text according to the word frequency of the word group of the disambiguation text; and obtaining the text score of the candidate text according to the product processing result of the probability function of the candidate text and the preset word number, and obtaining the text score of the disambiguation text according to the product processing result of the probability function of the disambiguation text and the preset word number. And aiming at each word segmentation result, the server screens and obtains a target field meeting a preset text score condition from the candidate field and the disambiguation field according to the text score of the combined text of the disambiguation field and the text score of the combined text of the candidate field. The preset text score condition may be that the text score is the highest, if the text score of the candidate field is the highest, the candidate field is used as the target field of the word segmentation result, and if the text score of the disambiguation field is the highest, the disambiguation field is used as the target field of the word segmentation result.
For example, the sub-fields of the forward and backward segmentation results of the general word dictionary are [ 'models', 'to' ] and [ 'modest', 'o' ], respectively, and the candidate fields of the segmentation results are [ 'modest', 'o' ] assuming that the disambiguation fields of the segmentation results obtained after the processing are [ 'models', 'to' ]. If the context word of the disambiguation field [ 'models', 'to' ] is [ 'of' ], the merged text of the disambiguation field [ 'models', 'to' ] is [ 'of', 'models', 'to' ], and the merged text of the candidate field [ 'models', 'o' ] is [ 'of', 'models', 'o' ]. Assuming that the text score of the merged text of the disambiguation field acquired by the server [ 'of', 'models', 'to' ] is 98, and the text score of the merged text of the candidate field acquired by the server [ 'of', 'modest', 'o' ] is 80, the target field of the participle result is [ 'models', 'to' ]. Assuming that the text score of the merged text of ',' models ',' to 'obtained by the server to the disambiguation field is 88, and the text score of the merged text of', 'modest', 'o' obtained to the candidate field is 93, the target field of the participle result is 'modest', 'o' ].
In the embodiment, the combined text of the disambiguation field is obtained by carrying out field combination processing on the disambiguation field of the participle result and the context word of the disambiguation field; carrying out field merging processing on the candidate fields of the word segmentation result and the context words of the candidate fields to obtain merged texts of the candidate fields; respectively inputting the combined text of the disambiguation field and the combined text of the candidate field into the participle evaluation model to obtain a text score of the combined text of the disambiguation field and a text score of the combined text of the candidate field; and then according to the text scores of the combined texts of the disambiguation fields and the text scores of the combined texts of the candidate fields, the target fields meeting the preset text score condition are screened from the candidate fields and the disambiguation fields, so that the problem that semantic ambiguity occurs in word segmentation results due to the fact that the context information of the texts to be segmented is ignored due to field disambiguation can be solved, and the word segmentation accuracy of the texts to be segmented is greatly improved.
In an embodiment, in step S204, the target fields of each word segmentation result are fused to obtain a target word segmentation result of the text to be word segmented, which specifically includes the following contents: splicing the target fields of each word segmentation result to obtain spliced texts of the target fields, wherein the spliced texts are the updated word segmentation results of the texts to be word segmented under each dictionary; and fusing the updated word segmentation results in each dictionary to obtain a target word segmentation result of the text to be segmented.
Specifically, the server splices the target fields of each word segmentation result to obtain a spliced text of the target fields, and it can be understood that the target fields are determined from the subfields of the forward word segmentation results and the subfields of the backward word segmentation results after field disambiguation processing and semantic disambiguation processing, so that the target fields can be regarded as the subfields obtained after the word segmentation results in each dictionary are subjected to the field disambiguation processing and the semantic disambiguation processing, and the spliced text obtained by splicing the target fields is an updated word segmentation result of the text to be segmented in each dictionary. And the server fuses the updated word segmentation results in each dictionary to obtain the target word segmentation result of the text to be segmented.
In the embodiment, the updated word segmentation result of the text to be word segmented under each dictionary is obtained by splicing the target fields of each word segmentation result; and then the updated word segmentation results in each dictionary are fused to obtain the target word segmentation result of the text to be segmented, so that the reasonable fusion of the updated word segmentation results of the text to be segmented in a plurality of dictionaries is realized, and the word segmentation accuracy of the text to be segmented is improved.
In one embodiment, the updated word segmentation results in each dictionary are fused to obtain a target word segmentation result of the text to be segmented, which specifically includes the following contents: determining the subfields of the updated word segmentation results in each dictionary, and performing field disambiguation on the subfields of the updated word segmentation results to obtain disambiguation fields of the text to be segmented; performing semantic disambiguation on the disambiguation field of the text to be participled to obtain a target field of the text to be participled; and splicing the target fields of the text to be participled to obtain a target word segmentation result of the text to be participled.
Specifically, the server performs field segmentation processing on each updated word segmentation result according to the first character and the last character of the updated word segmentation result in each dictionary to obtain the sub-field in each updated word segmentation result; wherein the first character and the tail character of the corresponding subfield in each updated participle result are the same. The server performs field disambiguation on the subfields of each updated participle result, which can be to input the subfields of each updated participle result into a participle evaluation model to obtain field scores of the subfields of the updated participle result, and then screen out the subfields meeting preset field conditions from the subfields of the updated participle result according to the field scores of the subfields of the updated participle result to serve as disambiguation fields of texts to be participled; and meanwhile, using the subfields corresponding to the disambiguation fields in the rest updated word segmentation results as candidate fields of the text to be segmented. For example, the second subfield of the updated segmentation result of the conventional word dictionary and the first subfield of the updated segmentation result of the thesaurus word dictionary are respectively determined as disambiguation fields of the text to be segmented, the first subfield of the updated segmentation result of the conventional word dictionary is a candidate field corresponding to the first subfield (i.e., disambiguation field) of the updated segmentation result of the thesaurus word dictionary, and the second subfield of the updated segmentation result of the thesaurus word dictionary is a candidate field corresponding to the second subfield (i.e., disambiguation field) of the updated segmentation result of the conventional word dictionary.
Further, the server performs semantic disambiguation on the disambiguation field of the text to be participled, which may be combining the disambiguation field of the text to be participled and the context word of the disambiguation field to obtain a combined text of the disambiguation field of the text to be participled; combining the candidate fields of the text to be participled and the context words of the candidate fields; the context word of the candidate field may be a context word of a disambiguation field corresponding to the candidate field in the combined text, so that the server performs field combining processing on the candidate field of the text to be participled and the context word of the candidate field, or performs field combining processing on the candidate field of the text to be participled and the context word of the disambiguation field of the text to be participled corresponding to the candidate field in the combined text, and the server obtains the combined text of the disambiguation field of the text to be participled. Respectively inputting the combined text of the disambiguation field of the text to be participled and the combined text of the candidate field of the text to be participled into the participle evaluation model to obtain the text score of the combined text of the disambiguation field of the text to be participled and the text score of the combined text of the candidate field of the text to be participled; and screening target fields meeting preset text score conditions from the candidate fields of the text to be segmented and the disambiguation fields of the text to be segmented as the target fields of the text to be segmented according to the text scores of the disambiguation fields of the text to be segmented and the text scores of the candidate fields of the text to be segmented. And finally, splicing the target fields of the text to be word-segmented to obtain a target word-segmentation result of the text to be word-segmented.
It should be noted that the word segmentation evaluation model mentioned in this embodiment and the word segmentation evaluation model mentioned in the foregoing embodiment may be the same model and implemented by the formula mentioned in the foregoing embodiment, and the number of preset words in the word segmentation evaluation model may be the same or different; of course, the word segmentation evaluation model in the present embodiment may be different from the word segmentation evaluation model mentioned in the above embodiments.
In the embodiment, the disambiguation field of the text to be participled is obtained by determining the sub-field of the updated participle result in each dictionary and performing field disambiguation on the sub-field of each updated participle result; carrying out semantic disambiguation on the disambiguation field of the text to be participled to obtain a target field of the text to be participled; the target fields of the text to be participled are spliced to obtain the target word segmentation result of the text to be participled, so that the field segmentation, field disambiguation, semantic disambiguation and field splicing of the updated word segmentation result under a plurality of dictionaries are realized, the target word segmentation result of the text to be participled is accurately obtained, and the word segmentation accuracy of the text to be participled is greatly improved through repeated disambiguation and verification of the updated word segmentation result of the plurality of dictionaries.
In one embodiment, as shown in fig. 4, another word segmentation processing method is provided, which is described by taking the method as an example applied to a server, and includes the following steps:
step S401, according to at least two dictionaries, respectively performing forward and backward word segmentation processing on the text to be segmented to obtain a forward word segmentation result and a backward word segmentation result of the text to be segmented in each dictionary.
Wherein, the word stock fields that at least two dictionaries belong to are all different.
Step S402, taking the difference value between the word length of the word to be verified in the forward word segmentation result and the reverse word segmentation result and the preset indentation length as the minimum word length, and taking the word length of the word to be verified as the maximum word length to obtain the word length range of the word to be verified.
And step S403, according to the word length within the word length range of the word to be verified, respectively performing indentation processing on the word to be verified in the forward word segmentation result and the reverse word segmentation result to obtain the indented word of the forward word segmentation result and the indented word of the reverse word segmentation result under each word length.
Step S404, searching words respectively matched with the indentation words under each word length from at least two dictionaries to obtain at least one candidate word matched with the indentation words.
Step S405, according to the word scores of the candidate words, screening out words meeting preset word conditions from at least one candidate word as target indentation words matched with the indentation words; and according to the target indented word, updating the forward word segmentation result and the reverse word segmentation result in each dictionary to obtain the word segmentation result of the text to be segmented in each dictionary.
Step S406, for each word segmentation result, according to the first character and the last character of the forward word segmentation result and the first character and the last character of the backward word segmentation result in the word segmentation result, performing field segmentation processing on the forward word segmentation result and the backward word segmentation result to obtain the sub-field in the forward word segmentation result and the sub-field in the backward word segmentation result.
Wherein the first character and the tail character of the corresponding sub-field in the forward word segmentation result and the reverse word segmentation result are the same.
In step S407, the sub-field in the forward word segmentation result and the sub-field in the backward word segmentation result are used as the sub-fields of the word segmentation result.
Step S408, aiming at each word segmentation result, inputting the sub-fields of the word segmentation results into the word segmentation evaluation model to obtain the field scores of the sub-fields of the word segmentation results.
Step S409, aiming at each word segmentation result, screening out subfields meeting preset field conditions from the subfields of the word segmentation result according to the field scores of the subfields of the word segmentation result, and using the subfields as disambiguation fields of the word segmentation result; and aiming at each word segmentation result, carrying out field combination processing on the disambiguation field of the word segmentation result and the context word of the disambiguation field to obtain a combined text of the disambiguation field.
Step S410, aiming at each word segmentation result, carrying out field merging processing on the candidate field of the word segmentation result and the context word of the candidate field to obtain a merged text of the candidate field; wherein the candidate field is a subfield of the segmentation result other than the disambiguation field.
Step S411, the combined text of the disambiguation field and the combined text of the candidate field are respectively input into the participle evaluation model, and the text score of the combined text of the disambiguation field and the text score of the combined text of the candidate field are obtained.
Step S412, aiming at each word segmentation result, according to the text score of the combined text of the disambiguation field and the text score of the combined text of the candidate field, screening the candidate field and the disambiguation field to obtain a target field meeting the preset text score condition.
And step S413, splicing the target field of each word segmentation result to obtain a spliced text of the target field, wherein the spliced text is the updated word segmentation result of the text to be segmented in each dictionary.
Step S414, determining the sub-field of the updated word segmentation result in each dictionary, and carrying out field disambiguation processing on the sub-field of each updated word segmentation result to obtain a disambiguation field of the text to be segmented; and carrying out semantic disambiguation on the disambiguation field of the text to be participled to obtain a target field of the text to be participled.
Step S415, performing a splicing process on the target field of the text to be word-segmented to obtain a target word-segmentation result of the text to be word-segmented.
The word segmentation processing method can achieve the following beneficial effects: according to at least two pre-constructed dictionaries, word segmentation processing is respectively carried out on the text to be segmented, word segmentation results of the text to be segmented under each dictionary are obtained, word segmentation processing of the text to be segmented through dictionaries in various different word bank fields is achieved, and word segmentation accuracy of short words and short sentences is improved. The sub-field of the word segmentation result is determined according to each word segmentation result, and the sub-field of the word segmentation result is subjected to field disambiguation processing to obtain a disambiguation field of the word segmentation result, so that the disambiguation of the sub-field in each word segmentation result is realized, the problem of continuous word segmentation errors of subsequent sub-fields caused by the word segmentation errors of the previous sub-field can be solved, and the word segmentation accuracy of the text to be segmented is further improved. Performing semantic disambiguation on the disambiguation field of the participle result aiming at each participle result to obtain a target field of the participle result; the method can solve the problem that ambiguity occurs in the word segmentation result due to the fact that the field disambiguation ignores the context information of the text to be segmented, and greatly improves the word segmentation accuracy of the text to be segmented. And fusing the target field of each word segmentation result to obtain a target word segmentation result of the text to be segmented, so that the reasonable fusion of the word segmentation results of the text to be segmented in a plurality of dictionaries is realized, and the word segmentation accuracy is greatly improved.
In order to clarify the word segmentation processing method provided by the embodiments of the present disclosure more clearly, the following describes the word segmentation processing method in a specific embodiment. As shown in fig. 5, another word segmentation processing method is provided, which can be applied to the server in fig. 1, and specifically includes the following steps:
(1) Constructing a dictionary: the dictionary comprises two dictionaries, namely a common word dictionary and a cursive word dictionary. Wherein the regular word dictionary is about 5.6w regular words in a Brown corpus (Brown Cropus), the song library word dictionary is 37.3w de-duplicated words in the song library, and the word frequency of all words is saved.
(2) Forward and reverse maximum matching and matching indentation: when the text which the user wants to inquire is fabulous rhythyms of modesto and the input text to be participled is fabulous rhythyms of modesto, the text to be participled is subjected to forward and reverse maximum matching processing respectively to obtain a forward word segmentation result and a reverse word segmentation result of the text to be participled in each dictionary. The forward word segmentation result and the reverse word segmentation result of the text to be word segmented are shown in table 3.
TABLE 3 forward and reverse word segmentation results for text to be segmented
Common dictionary Qukuzi dictionary
Forward direction ['fabulous','rhythms','of','modest','o'] ['fabulous','rhythms','ofm','odes','to']
Reverse direction ['fabulous','rhythms','of','modes','to'] ['fabulous','rhythm','sof','modesto']
It can be known from table 3 that the forward word segmentation result and the reverse word segmentation result of the text to be segmented are simultaneously influenced by two factors, namely the matching direction and the constructed dictionary. And if the required words do not exist in the 4 segmentation results, performing indentation processing on the words to be verified in the 4 segmentation results to obtain indented words of the words to be verified. When the length of the word to be verified is m and the preset indentation length is n, the server retrieves the matching condition of all the words with the lengths from m-n to m in the common word dictionary and the thesaurus dictionary, and takes the retrieved word matched with the indentation word as a candidate word; taking the candidate words as target indentation words of the words to be verified under the condition that the number of the candidate words is one; and under the condition that the number of the candidate words is more than one, screening out words meeting a preset score condition from the candidate words according to the scores of the candidate words, and taking the words as target indentation words corresponding to the words to be verified. And according to the target indentation words, updating the words to be verified in the forward word segmentation result and the reverse word segmentation result in each dictionary.
(3) Public field segmentation: and taking all words between the word with the same first character and the word with the same tail character in the forward word segmentation result and the backward word segmentation result as a sub-field to obtain the sub-field of the forward word segmentation result and the sub-field of the backward word segmentation result. Taking the forward word segmentation result and the backward word segmentation result of the normal word dictionary and the cursive word dictionary in the step (2) as an example, the subfields of the forward word segmentation result and the backward word segmentation result of the normal word dictionary are shown in table 4, and the subfields of the forward word segmentation result and the backward word segmentation result of the cursive word dictionary in the step (2) are shown in table 5.
TABLE 4 subfields of forward and backward vocabulary segmentation results of a common vocabulary dictionary
Positive word segmentation result Reverse word segmentation result
The first sub-field ['fabulous','rhythms','of] ['fabulous','rhythms','of]
Second sub-field ['modes','to'] ['modest','o']
TABLE 5 subfields of forward and backward segmentation results for a thesaurus
Positive word segmentation result Reverse word segmentation result
The first sub-field ['fabulous'] ['fabulous']
The second sub-field ['rhythms','ofm','odes','to'] ['rhythm','sof','modesto']
Taking table 4 as an example, the first subfields of the common dictionary are f of fabulous as the first character, f of f as the first character, and the second subfields are m as the first character, o as the first character.
(4) Disambiguation by field: the field scores of the sub-fields are obtained through a language model of the n-gram, which can be expressed as follows:
Figure BDA0003959008630000221
wherein Score represents a field Score of the subfield; n represents the language model of the n-gram as n-element; t represents the t-th phrase in the field; v (t) represents a probability function.
The probability function may be a probability function with a penalty, and the probability function v (t) may be represented as follows:
Figure BDA0003959008630000222
wherein, p (t) represents the word frequency of the n-element word group corresponding to the language model of the n-gram, namely the times of the n-element word group appearing in the common word dictionary and the kokura word dictionary; when the minimum word number of the subfield is less than 3, n is 2, otherwise n is 3.
The server obtains the disambiguation field of each word segmentation result according to the field score of the subfield of each word segmentation result, and takes the forward word segmentation result and the reverse word segmentation result of the common word dictionary and the Qukuba word dictionary in the step (3) as an example, and the disambiguation fields of the common word dictionary and the Qukuba word dictionary are shown in Table 6.
TABLE 6 disambiguation fields of common word dictionary and Qukuchi word dictionary
Figure BDA0003959008630000223
(5) Global disambiguation: the field disambiguation of the sub-fields in the step (4) avoids the transfer of wrong participles among the sub-fields as much as possible, but also prevents the sub-fields from obtaining the semantic information of the context, so that the logic of the merged text of the target field is possibly low. In order to solve the defect, the server carries out semantic disambiguation processing on the disambiguation field of the participle result according to the combined text of the disambiguation field of the participle result of the common word dictionary and the Qukungunya word dictionary to obtain the target field of the participle result.
(6) And (3) merging multiple dictionaries: and after acquiring the target fields of the common word dictionary and the song library word dictionary, the server splices the target fields to obtain spliced texts, and the spliced texts under each dictionary are used as the updated word segmentation results of the texts to be segmented under each dictionary. And further, the updated word segmentation result in each dictionary is subjected to the steps (3) to (5) again, and finally the target word segmentation result of the text to be segmented is obtained. The spliced text under the common word dictionary, the spliced text under the thesaurus, the target field of the text to be segmented and the target segmentation result are shown in table 7.
TABLE 7 concatenate text, target field, and target word segmentation results
Figure BDA0003959008630000231
In the embodiment, the word segmentation processing of the text to be segmented is realized through dictionaries in various different word bank fields, the problem of continuous word segmentation error of subsequent sub-fields caused by word segmentation error of the previous sub-field is solved, the problem of ambiguity of word segmentation results caused by neglecting context information of the text to be segmented due to field disambiguation is also solved, the updated word segmentation results of the text to be segmented in a plurality of dictionaries are reasonably fused, and the word segmentation accuracy of the text to be segmented is greatly improved.
It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.
In one embodiment, a computer device is provided, which may be a server, and its internal structure diagram may be as shown in fig. 6. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the non-volatile storage medium. The database of the computer device is used for storing data such as word segmentation results, target fields, target word segmentation results and the like. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a method of word segmentation processing.
Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In an embodiment, a computer device is further provided, which includes a memory and a processor, the memory stores a computer program, and the processor implements the steps of the above method embodiments when executing the computer program.
In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the steps of the above-mentioned method embodiments.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, carries out the steps in the method embodiments described above.
It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, databases, or other media used in the embodiments provided herein can include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), magnetic Random Access Memory (MRAM), ferroelectric Random Access Memory (FRAM), phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, which falls within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (12)

1. A word segmentation processing method, characterized in that the method comprises:
according to at least two pre-constructed dictionaries, performing word segmentation processing on a text to be segmented respectively to obtain word segmentation results of the text to be segmented in each dictionary; the word stock fields of the at least two dictionaries are different;
determining the sub-field of the word segmentation result aiming at each word segmentation result, and carrying out field disambiguation on the sub-field of the word segmentation result to obtain a disambiguation field of the word segmentation result;
performing semantic disambiguation on disambiguation fields of the word segmentation results aiming at each word segmentation result to obtain target fields of the word segmentation results;
and fusing the target field of each word segmentation result to obtain a target word segmentation result of the text to be word segmented.
2. The method according to claim 1, wherein the obtaining a word segmentation result of the text to be word segmented in each dictionary by performing word segmentation processing on the text to be word segmented respectively according to at least two pre-constructed dictionaries comprises:
according to the at least two dictionaries, performing forward word segmentation and backward word segmentation on the text to be word segmented respectively to obtain a forward word segmentation result and a backward word segmentation result of the text to be word segmented in each dictionary;
according to a preset indentation length, respectively carrying out indentation processing on the forward word segmentation result and the reverse word segmentation result under each dictionary to obtain an indentation word corresponding to the forward word segmentation result and an indentation word corresponding to the reverse word segmentation result;
searching target indentation words matched with the indentation words from the at least two dictionaries;
and updating the forward word segmentation result and the reverse word segmentation result of each dictionary according to the target indented words to obtain the word segmentation result of the text to be segmented in each dictionary.
3. The method according to claim 2, wherein the performing indentation processing on the forward segmentation result and the backward segmentation result in each dictionary respectively according to a preset indentation length to obtain an indented word corresponding to the forward segmentation result and an indented word corresponding to the backward segmentation result comprises:
taking the difference between the word length of the word to be verified in the forward word segmentation result and the reverse word segmentation result and the preset indentation length as the minimum word length, and taking the word length of the word to be verified as the maximum word length to obtain the word length range of the word to be verified;
according to the word length of the word length range of the word to be verified, indentation processing is respectively carried out on the word to be verified in the forward word segmentation result and the reverse word segmentation result, and indentation words of the forward word segmentation result and the reverse word segmentation result under the word length are obtained.
4. The method according to claim 3, wherein said retrieving target indentation words from said at least two dictionaries that match respective ones of said indentation words comprises:
searching words respectively matched with the indentation words under the length of each word from the at least two dictionaries to obtain at least one candidate word matched with the indentation words;
and according to the word score of each candidate word, screening out words meeting preset word conditions from the at least one candidate word to serve as target indentation words matched with the indentation words.
5. The method of claim 1, wherein the determining, for each of the participle results, a subfield of the participle result and performing field disambiguation on the subfield of the participle result to obtain a disambiguated field of the participle result comprises:
for each word segmentation result, performing field segmentation processing on the word segmentation result to obtain sub-fields of the word segmentation result;
for each word segmentation result, inputting the sub-field of the word segmentation result into a word segmentation evaluation model to obtain the field score of the sub-field of the word segmentation result;
and aiming at each word segmentation result, screening out the sub-fields meeting preset field conditions from the sub-fields of the word segmentation result according to the field scores of the sub-fields of the word segmentation result, and taking the sub-fields as disambiguation fields of the word segmentation result.
6. The method according to claim 5, wherein the performing, for each of the word segmentation results, field segmentation on the word segmentation result to obtain sub-fields of the word segmentation result comprises:
for each word segmentation result, according to the first character and the last character of the forward word segmentation result and the first character and the last character of the backward word segmentation result in the word segmentation result, performing field segmentation processing on the forward word segmentation result and the backward word segmentation result to obtain sub-fields in the forward word segmentation result and sub-fields in the backward word segmentation result; wherein the first character and the tail character of the corresponding sub-field in the forward word segmentation result and the backward word segmentation result are the same;
and using the sub-field in the forward word segmentation result and the sub-field in the reverse word segmentation result as the sub-field of the word segmentation result.
7. The method of claim 1, wherein said semantic disambiguating a disambiguating field of said participle result for each said participle result to obtain a target field of said participle result comprises:
for each word segmentation result, carrying out field merging processing on a disambiguation field of the word segmentation result and a context word of the disambiguation field to obtain a merged text of the disambiguation field;
for each word segmentation result, carrying out field merging processing on a candidate field of the word segmentation result and a context word of the candidate field to obtain a merged text of the candidate field; wherein the candidate field is a subfield of the word segmentation result other than the disambiguation field;
respectively inputting the combined text of the disambiguation field and the combined text of the candidate field into a participle evaluation model to obtain a text score of the combined text of the disambiguation field and a text score of the combined text of the candidate field;
and for each word segmentation result, screening target fields meeting preset text score conditions from the candidate fields and the disambiguation fields according to the text scores of the combined texts of the disambiguation fields and the text scores of the combined texts of the candidate fields.
8. The method according to claim 1, wherein the fusing the target field of each word segmentation result to obtain the target word segmentation result of the text to be word segmented comprises:
splicing the target fields of each word segmentation result to obtain spliced texts of the target fields, wherein the spliced texts are the updated word segmentation results of the texts to be word segmented under each dictionary;
and fusing the updated word segmentation results in each dictionary to obtain a target word segmentation result of the text to be word segmented.
9. The method according to claim 8, wherein the fusing the updated segmentation results in each dictionary to obtain the target segmentation result of the text to be segmented comprises:
determining the subfields of the updated word segmentation results in each dictionary, and performing field disambiguation on the subfields of the updated word segmentation results to obtain disambiguation fields of texts to be segmented;
performing semantic disambiguation on the disambiguation field of the text to be participled to obtain a target field of the text to be participled;
and splicing the target fields of the text to be participled to obtain a target word segmentation result of the text to be participled.
10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor realizes the steps of the method of any one of claims 1 to 9 when executing the computer program.
11. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 9.
12. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 9 when executed by a processor.
CN202211478975.6A 2022-11-23 2022-11-23 Word segmentation processing method, computer device, storage medium, and computer program product Pending CN115796176A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211478975.6A CN115796176A (en) 2022-11-23 2022-11-23 Word segmentation processing method, computer device, storage medium, and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211478975.6A CN115796176A (en) 2022-11-23 2022-11-23 Word segmentation processing method, computer device, storage medium, and computer program product

Publications (1)

Publication Number Publication Date
CN115796176A true CN115796176A (en) 2023-03-14

Family

ID=85440795

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211478975.6A Pending CN115796176A (en) 2022-11-23 2022-11-23 Word segmentation processing method, computer device, storage medium, and computer program product

Country Status (1)

Country Link
CN (1) CN115796176A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910278A (en) * 2023-09-14 2023-10-20 深圳市智慧城市科技发展集团有限公司 Data dictionary generation method, terminal device and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116910278A (en) * 2023-09-14 2023-10-20 深圳市智慧城市科技发展集团有限公司 Data dictionary generation method, terminal device and storage medium

Similar Documents

Publication Publication Date Title
US9223779B2 (en) Text segmentation with multiple granularity levels
KR102268875B1 (en) System and method for inputting text into electronic devices
JP5379155B2 (en) CJK name detection
US10102191B2 (en) Propagation of changes in master content to variant content
US8082270B2 (en) Fuzzy search using progressive relaxation of search terms
JPH079655B2 (en) Spelling error detection and correction method and apparatus
JPH10260968A (en) Method for dividing chinese sentence into clases and its application to chinese error check system
CN103733193A (en) Statistical spell checker
CN112800769B (en) Named entity recognition method, named entity recognition device, named entity recognition computer equipment and named entity recognition storage medium
JP7052145B2 (en) Token matching in a large document corpus
TW201822190A (en) Speech recognition system and method thereof, vocabulary establishing method and computer program product
CN113535986B (en) Data fusion method and device applied to medical knowledge graph
WO2013127060A1 (en) Techniques for transliterating input text from a first character set to a second character set
CN115796176A (en) Word segmentation processing method, computer device, storage medium, and computer program product
CN115688779A (en) Address recognition method based on self-supervision deep learning
US9965546B2 (en) Fast substring fulltext search
CN114003685B (en) Word segmentation position index construction method and device, and document retrieval method and device
CN101937450B (en) Method for retrieving items represented by particles from an information database
CN111581344A (en) Interface information auditing method and device, computer equipment and storage medium
CN110795617A (en) Error correction method and related device for search terms
CN112800314B (en) Method, system, storage medium and equipment for search engine query automatic completion
CN114896382A (en) Artificial intelligent question-answering model generation method, question-answering method, device and storage medium
US10565195B2 (en) Records based on bit-shifting
US11281736B1 (en) Search query mapping disambiguation based on user behavior
CN113076740A (en) Synonym mining method and device in government affair service field

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination