WO2021237562A1 - 文本模板提取方法、电子设备和存储介质 - Google Patents

文本模板提取方法、电子设备和存储介质 Download PDF

Info

Publication number
WO2021237562A1
WO2021237562A1 PCT/CN2020/092871 CN2020092871W WO2021237562A1 WO 2021237562 A1 WO2021237562 A1 WO 2021237562A1 CN 2020092871 W CN2020092871 W CN 2020092871W WO 2021237562 A1 WO2021237562 A1 WO 2021237562A1
Authority
WO
WIPO (PCT)
Prior art keywords
template
matching
sentence
processed
matrix
Prior art date
Application number
PCT/CN2020/092871
Other languages
English (en)
French (fr)
Inventor
汪庆华
Original Assignee
深圳市欢太数字科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太数字科技有限公司 filed Critical 深圳市欢太数字科技有限公司
Priority to CN202080099874.0A priority Critical patent/CN115803748A/zh
Priority to PCT/CN2020/092871 priority patent/WO2021237562A1/zh
Publication of WO2021237562A1 publication Critical patent/WO2021237562A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions

Definitions

  • This application relates to the technical field of data processing, and in particular to a method for extracting a text template, an electronic device, and a storage medium.
  • This application provides a method for extracting a text template, an electronic device, and a storage medium.
  • the embodiment of the present application provides a method for extracting a text template.
  • the text template extraction method includes:
  • the matching part in each sentence to be processed is removed to update the plurality of sentences to be processed, and enter the The step of determining the matching matrix of a plurality of said sentences to be processed;
  • the current plurality of sentences to be processed is regarded as the final sentence group, and the original sentence is processed according to the wildcard and the final sentence group Group to obtain the template of the original sentence group.
  • the embodiment of the present application provides an electronic device.
  • the electronic device includes a processor configured to obtain a set of original sentence groups in a text data set, the original sentence group including a plurality of sentences to be processed; and a matching matrix for establishing a plurality of the sentences to be processed And for determining whether there is a matching part in the plurality of sentences to be processed according to the matching matrix; and for removing each of the matching parts when it is determined according to the matching matrix that there is a matching part in the plurality of sentences to be processed
  • the matching part of the sentence to be processed is used to update a plurality of the sentences to be processed and enter the step of determining the matching matrix of the plurality of sentences to be processed; When there is no matching part among the plurality of sentences to be processed, the current plurality of sentences to be processed is regarded as the final sentence group, and the original sentence group is processed according to the wildcard and the final sentence group to obtain the original sentence group.
  • the template of the statement group is used to update a plurality of the sentences to be
  • the embodiment of the present application provides a non-volatile computer-readable storage medium containing computer-executable instructions.
  • the processors execute the above-mentioned text. Template extraction method.
  • the text template extraction method, electronic device, and storage medium of the embodiment of the present application obtains a set of original sentence groups from a text data set.
  • the original sentence group includes multiple sentences to be processed, and then directly extracts and processes the multiple sentences to be processed.
  • Obtaining the template of the original sentence group there is no need to label and code the sentence to be processed before the extraction process, which not only avoids the error caused by the labeling and coding, but also reduces the manual intervention, which is beneficial to improve the efficiency and accuracy of template extraction.
  • FIG. 1 is a schematic flowchart of a text template extraction method according to an embodiment of the present application
  • FIG. 2 is a schematic diagram of modules of an electronic device according to an embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a method for extracting a text template according to another embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a method for extracting a text template according to another embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a method for extracting a text template according to still another embodiment of the present application.
  • FIG. 6 is a schematic diagram of the establishment process of the matching matrix of the text template extraction method of the embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a text template extraction method according to another embodiment of the present application.
  • FIG. 8 is a schematic flowchart of a method for extracting a text template according to another embodiment of the present application.
  • FIG. 9 is a schematic flowchart of a text template extraction method according to still another embodiment of the present application.
  • FIG. 10 is a schematic flowchart of a method for extracting a text template according to another embodiment of the present application.
  • FIG. 11 is a schematic flowchart of a method for extracting a text template according to another embodiment of the present application.
  • FIG. 12 is a schematic flowchart of a method for extracting a text template according to still another embodiment of the present application.
  • an embodiment of the present application provides a method for extracting a text template.
  • the text template extraction methods include:
  • Step S12 Obtain a group of original sentence groups in the text data set, the original sentence group includes a plurality of sentences to be processed;
  • Step S13 Establish a matching matrix of multiple sentences to be processed
  • Step S14 Determine whether there is a matching part in the multiple sentences to be processed according to the matching matrix
  • Step S17 When it is determined according to the matching matrix that there is a matching part in the multiple sentences to be processed, the matching part in each sentence to be processed is removed to update the multiple sentences to be processed, and enter the matching matrix for determining the multiple sentences to be processed A step of;
  • Step S18 When it is determined according to the matching matrix that there is no matching part in the multiple sentences to be processed, the current multiple sentences to be processed are regarded as the final sentence group, and the original sentence group is processed according to the wildcard and the final sentence group to obtain the original sentence group Template.
  • an embodiment of the present application provides an electronic device 100.
  • the electronic device 100 includes a processor 101.
  • the processor 101 is configured to obtain a group of original sentence groups in a text data set, the original sentence group includes a plurality of sentences to be processed; and is used to establish a matching matrix of the plurality of sentences to be processed; and According to the matching matrix, determine whether there is a matching part in the multiple sentences to be processed; and used to remove the matching part in each sentence to be processed to update the multiple sentences when it is determined that there is a matching part in the multiple sentences to be processed according to the matching matrix Sentence to be processed, and enter the step of determining the matching matrix of multiple sentences to be processed; and used to use the current multiple sentences to be processed as the final sentence group when it is determined according to the matching matrix that there is no matching part in the multiple sentences to be processed , And process the original sentence group according to the wildcard and the final sentence group to get the template of the original sentence group.
  • the text template extraction method and electronic device 100 of the embodiment of the present application obtain a set of original sentence groups from a text data set.
  • the original sentence group includes a plurality of sentences to be processed, and then directly extract and process the plurality of sentences to be processed to obtain the original
  • the template of the sentence group does not need to label and code the sentence to be processed before the extraction process, which not only avoids the error caused by the labeling and coding, but also reduces the manual intervention, which is beneficial to improve the efficiency and accuracy of template extraction.
  • the related technology can firstly vectorize the text to be processed (embedding), for example, vectorize the word2vector algorithm, then perform clustering, then sample each category, and finally manually extract the template.
  • This method reduces manpower to a certain extent.
  • the text needs to be segmented before vectorization, the overhead of word segmentation is inevitable.
  • different word segmentation algorithms may get different results.
  • the accuracy of the word segmentation results greatly affects the accuracy of the program.
  • the clustering method is used to gather the data of the same template, which is limited by the clustering algorithm. Different clustering algorithms may get different results.
  • the clustering algorithm is an unsupervised algorithm that is randomly initialized. When clustering on a data set with an unknown distribution, a large deviation is likely to occur, which greatly increases the cost of subsequent manual extraction.
  • the text template extraction method and electronic device 100 of the embodiment of the present application directly extract multiple sentences to be processed based on the words of the sentence in the dimension of the sentence to obtain the template of the original sentence group without processing before the extraction process.
  • Processing sentences for word segmentation, labeling and coding not only avoids the errors and extra costs caused by word segmentation, labeling and coding, but also reduces manual intervention, which is beneficial to improve the efficiency and accuracy of template extraction.
  • the text data set may be a data set of formatted sentences.
  • multiple statements to be processed may be formatted statements.
  • a formatted sentence refers to a sentence having a fixed format, and these formats are related to the template to be extracted by the method of the embodiment of the present application. In this way, the possibility of extracting templates from multiple sentences to be processed is improved, and the template cannot be extracted from the sentences to be processed because the sentences are unformatted sentences.
  • the formatted sentence can be a short message, such as a verification code type short message, and a notification type short message.
  • the text template extraction method of the embodiment of the present application can be applied to the scenario of short message template extraction.
  • text template extraction method of the embodiment of the present application can also be applied to scenarios such as push message template extraction, email subject template extraction, and spam filtering rule generation. There is no limitation here.
  • the number of sentences to be processed included in a set of original sentence groups can be: 2, 3, 4, 5, or other values. There is no limitation here.
  • the specific form of the matching matrix may correspond to the number of sentences to be processed.
  • the matching matrix can be a two-dimensional matrix; another example is when the number of sentences to be processed is three, the matching matrix can be a three-dimensional matrix; When the number of sentences to be processed is four, the matching matrix may be a four-dimensional matrix.
  • the matching part may refer to at least one of the same part, similar part, and corresponding part in each sentence to be processed.
  • the specific judgment standard can be determined by the input information input by the user, which is not limited here. In this way, specific judgment criteria can be flexibly set, and the applicability of the text template extraction method can be improved.
  • step S17 and step S18 it is determined according to the matching matrix that there is a matching part in a plurality of sentences to be processed, which means that each sentence to be processed includes a matching part.
  • the matching part is a substring of each sentence to be processed.
  • the number of sentences to be processed is three, namely: "Shanghai”, “Shenzhen”, and "Beijing", it can be determined that there is no matching part in the multiple sentences to be processed.
  • the text template extraction method of the embodiment of the present application can improve the accuracy of template extraction, the accuracy of using the template to obtain target information from the information to be processed will also be improved, that is, it can be more effective and accurate.
  • the original sentence group includes three sentences to be processed, namely: "Welcome to Shanghai”, “Welcome to Shenzhen", and "Welcome to Beijing".
  • the template of the original sentence group obtained is: "Welcome to *”.
  • the target information that is, the user's travel location
  • the target information can be extracted through the "Welcome to*" template.
  • the noises are, for example, "The temperature in Beijing today is 10 degrees to 20 degrees", "Your data for this month has been used up, please recharge in time”.
  • the original sentence group includes two sentences to be processed, namely: "Today's temperature in Beijing is 10 to 20 degrees” and "Today's temperature in Shenzhen is 5 to 15 degrees.”
  • the template of the original sentence group obtained is: "Today's temperature is from *degree to *degree”
  • the target information that is, the temperature range of the user's travel destination
  • the template of “Today's temperature is from *degree to *degree”.
  • the influence of noise on the extraction of target information can be avoided. Examples of noise are "Welcome to Beijing" and "Your data for this month has been used up, please recharge in time”.
  • the original sentence group includes two sentences to be processed, namely: "Your data for this month has been used up, please recharge in time” and "Your phone bill has been used up this month, please recharge in time”.
  • the template of the original sentence group obtained is: "Your * has been used up this month, please recharge in time”
  • step S18 the original sentence group is processed according to the wildcard and the final sentence group, so that the template of the original sentence group includes the wildcard.
  • wildcards can be universally matched to perform fuzzy search on data.
  • the search data of the original sentence group can be used subsequently, so that the searched data meets the template.
  • the template of the original sentence group can be a regular expression.
  • step S13 includes:
  • Step S131 Determine the matching score between each character to be processed in each sentence to be processed and each character to be processed in other sentences to be processed;
  • Step S132 Establish a matching matrix according to the matching score.
  • the processor 101 is used to determine the matching score of each character to be processed in each sentence to be processed and each character to be processed in other sentences to be processed; and to determine according to the matching score Matching matrix.
  • the matching scores of the characters to be processed are used to establish the matching matrix of multiple sentences to be processed, and the degree of matching can be quantified at the character level, so that the establishment of the matching matrix is more detailed, efficient, and accurate.
  • step S131 for each sentence to be processed, the matching score of each character to be processed and each character to be processed in other sentences to be processed may be determined in sequence according to the character sequence of the sentence to be processed. In this way, the process of determining the matching score is made more regular, and the result error caused by the confusion of the determination process is avoided.
  • the number of sentences to be processed is two, namely: “please go north” and “please go south”. You can first determine the matching points of "please” in “please to north” and “please” in “please to south”; then determine “please” in “please to north” and “please to south” The matching scores of "xiang” in “Please” and the matching scores of "Please” in “Please go north” and “South” in “Please go south”;
  • step S132 a matching matrix is established according to the matching score, and the matching score can be directly used as the matrix value of the matching matrix; the matrix value of the matching matrix can also be calculated according to the matching score.
  • the specific method of establishing the matching matrix based on the matching score is not limited here.
  • the multiple sentences to be processed include a first sentence and a second sentence
  • step S131 includes:
  • Step S1311 When the first current character of the first sentence matches the second current character of the second sentence, the first preset score is used as the matching score of the first current character and the second current character;
  • Step S1312 When the first current character does not match the second current character, the second preset score is used as the matching score of the first current character and the second current character, and the second preset score is less than the first preset Points.
  • the plurality of sentences to be processed includes a first sentence and a second sentence
  • the processor 101 is configured to: when the first current character of the first sentence matches the second current character of the second sentence, the first sentence
  • the preset score is used as the matching score of the first current character and the second current character; and when the first current character does not match the second current character, the second preset score is used as the first current character and the second current character. 2.
  • the matching score of the current character, the second preset score is less than the first preset score.
  • the determination of the matching score can be achieved, which can avoid too many types of matching scores, reduce the complexity of calculation, and help shorten the text template extraction method. execution time.
  • the match between the first current character and the second current character may refer to at least one of the same, similarity, and correspondence between the first current character and the second current character.
  • the specific matching standard can be determined by the input information input by the user, which is not limited here. In this way, specific matching criteria can be flexibly set, and the applicability of the text template extraction method can be improved.
  • the following takes the first current character and the second current character to match that the first current character is the same as the second current character as an example for explanation and description.
  • n and m are the lengths of the first sentence and the second sentence, respectively.
  • the matching score can be determined by the following formula:
  • the first preset score may also be +1, +2, +4 or other numerical values; the second preset score may also be -1, -2, -5 or other numerical values ; The first preset score and the second preset score may be opposite to each other or not to each other.
  • the specific numerical value and specific relationship of the first preset score and the second preset score are not limited here.
  • the multiple sentences to be processed include a first sentence and a second sentence.
  • the first sentence includes the first current character
  • the second sentence includes the second current character
  • the matching score includes the first sentence.
  • the current matching score of the current character and the second current character, the matching matrix includes the current position, and step S132 includes:
  • Step S1320 Initialize the first row and the first column of the matching matrix with a preset initial value
  • Step S1321 Determine the first candidate value of the current position according to the current matching score and the matrix value of the upper left position of the current position;
  • Step S1322 Subtract the first penalty value from the matrix value of each upper position of the current position to obtain each upper penalty value, and use the maximum value of the upper penalty value as the second candidate value of the current position;
  • Step S1323 Subtract the second penalty value from the matrix value of each left position of the current position to obtain each left penalty value, and use the maximum value of the left penalty value as the third candidate value of the current position ;
  • Step S1324 The maximum value among the first value to be selected, the second value to be selected, the third value to be selected, and the initial value is used as the matrix value of the current position.
  • the multiple sentences to be processed include a first sentence and a second sentence.
  • the first sentence includes the first current character
  • the second sentence includes the second current character
  • the matching score includes the first current character and the second sentence.
  • the current matching score of the current character, the matching matrix includes the current position
  • the processor 101 is used to initialize the first row and the first column of the matching matrix with preset initial values;
  • the matrix value of the upper left position determines the first candidate value of the current position; and is used to subtract the first penalty value from the matrix value of each upper position of the current position to obtain each upper penalty value, and the upper penalty
  • the maximum value among the values is used as the second candidate value of the current position; and the second penalty value is used to subtract the second penalty value from the matrix value of each left position of the current position to obtain each left penalty value, and the left
  • the maximum value of the penalty value is used as the third candidate value of the current position; and the maximum value among the first candidate value, the second candidate value, the third candidate value and the initial value is used as the matrix
  • the first candidate value is determined according to the current matching score and the matrix value
  • the second candidate value and the third candidate value are determined according to the matrix value and the penalty value
  • the first candidate value, the second candidate value, The third candidate value and the initial value determine the matrix value of the current position, which can realize the establishment of the matching matrix.
  • the current matrix value can reflect the character string from the first character of the first sentence to the first current character, and the first character to the first character of the second sentence.
  • the second current character string the degree of matching. In this way, the isolated matching between characters and characters is avoided, so that the matrix value of the matching matrix can measure whether the substring and the substring match, thereby making it more accurate to determine whether there is a matching part in the first sentence and the second sentence according to the matching matrix. high.
  • the preset initial value may be -3, -2, -1, 0, +1, +2, +3 or other values. There is no limitation here.
  • the initial value is 0. In this way, the complexity of subsequent calculations can be reduced, thereby shortening the execution time of the method.
  • step S1321 in this embodiment, the sum of the current matching score and the matrix value at the upper left position may be used as the first candidate value. In this way, the matrix value at the current position is correlated with the matrix value at the upper left position.
  • the product of the current matching score and the matrix value at the upper left position may also be used as the first candidate value; in other other implementation manners, the current matching score may also be set Substitute the matrix value at the upper left position into the preset formula, and use the obtained value as the first candidate value.
  • the specific manner of step S1321 is not limited here.
  • the matrix value of each upper position of the current position may be subtracted from the corresponding first penalty value to obtain each upper penalty value. For example, subtract the first penalty sub-value from the matrix value of the first upper position of the current position; subtract the second penalty sub-value from the matrix value of the second upper position of the current position; subtract the second penalty sub-value from the matrix value of the third upper position of the current position Value minus the third penalty sub-value. In this way, different degrees of punishment can be performed on the matrix value of each upper position of the current position, making the punishment more flexible.
  • the matrix value of each left position of the current position may be subtracted from the corresponding second penalty value to obtain each left penalty value. For example, subtract the first penalty value from the matrix value of the first left position of the current position; subtract the second penalty value from the matrix value of the second left position of the current position; subtract the second penalty value from the third left position of the current position The matrix value of the position minus the third penalty value. In this way, different degrees of punishment can be performed on the matrix value of each left position of the current position, making the punishment more flexible.
  • the same first penalty value can also be subtracted from the matrix value of each upper position of the current position to obtain each upper penalty value. It is also possible to subtract the same second penalty value from the matrix value of each left position of the current position to obtain each left penalty value.
  • the first penalty value can be -3, -2, -1, 0, +1, +2, +3 or other values.
  • the second penalty value can be -3, -2, -1, 0, +1, +2, +3 or other values.
  • the first penalty value and the second penalty value can be the same or different.
  • the first penalty value and the second penalty value are not limited here.
  • the first penalty value and the second penalty value are the same. In this way, the complexity of subsequent calculations can be reduced, thereby shortening the execution time of the method.
  • the matrix value of the current position of the matching matrix H is determined by the following formula as Hi,j :
  • H i,j are the matrix values of the current position of the matching matrix H
  • H i-1,j-1 are the matrix values of the upper left position of the current position
  • S(a i ,b j ) is the current matching score
  • Hi ,j is the matrix value of each upper position of the current position
  • W k is the first penalty value
  • k is the traversal from 1 to i
  • Hi,jl is the matrix value of each left position of the current position
  • l It is the traversal from 1 to j
  • W l is the second penalty value
  • 0 is the initial value.
  • the first candidate value can be obtained by formula (2) H i-1,j-1 +S(a i , b j ); formula (2) max k ⁇ 1 ⁇ H ik, j -W k ⁇ , the second candidate value can be obtained; the third candidate value can be obtained by formula (2) max l ⁇ 1 ⁇ H i,jl -W l ⁇ ; the third candidate value can be obtained by formula (2) 0 is the initial value. Then, the maximum value among the first value to be selected, the second value to be selected, the third value to be selected, and the initial value can be used as the matrix value of the current position.
  • the matrix value of the current position is related to the matrix value of the upper left position of the current position, the matrix value of each upper position of the current position, and the matrix value of each left position of the current position, the matching matrix
  • the filling order of H is: from left to right, from top to bottom.
  • step S14 includes:
  • Step S141 When the matrix values of the matching matrix are all preset initial values, it is determined that there is no matching part in the plurality of sentences to be processed;
  • Step S142 when the matrix values of the matching matrix are not all preset initial values, determine the maximum matrix value of the matching matrix
  • Step S143 Backtracking the matching matrix according to the maximum matrix value to determine the matching part of the multiple sentences to be processed.
  • the processor 101 is configured to determine that when the matrix values of the matching matrix are all preset initial values, there is no matching part in the multiple sentences to be processed; and when the matrix values of the matching matrix are not all When the preset initial value is used, the maximum matrix value of the matching matrix is determined; and the matching matrix is used to backtrack the matching matrix according to the maximum matrix value to determine the matching parts in the multiple sentences to be processed.
  • the matrix values of the matching matrix are not all preset initial values
  • the preset initial value is 0, and the matrix values of the matching matrix are all 0, and it is determined that there is no matching part in the plurality of sentences to be processed.
  • the preset initial value is 0, and the matrix value of the matching matrix is 3 in addition to 0, it can be determined that there are matching parts in multiple sentences to be processed.
  • step S142 and step S143 when the number of maximum matrix values of the matching matrix is one, the matching matrix can be traced back according to the one maximum matrix value to determine a matching part of the multiple sentences to be processed.
  • the matching matrix can be backtracked according to the multiple maximum matrix values to determine multiple matching parts in the multiple sentences to be processed.
  • the number of maximum matrix values of the matching matrix will be explained and explained as one in the following.
  • step S143 the matching matrix can be traced back according to the maximum matrix value and the aforementioned formula (2) to determine the matching parts in the multiple sentences to be processed. In this way, the matching part can be accurately and efficiently determined.
  • the matrix value Hi,j of the current position is the maximum value among the first candidate value, the second candidate value, the third candidate value, and the initial value.
  • the maximum matrix value of the matching matrix must not be the initial value.
  • the maximum matrix value of the matching matrix must be related to one of the matrix value at its upper left position, the matrix value at its upper position, or the matrix value at its left position. Therefore, it is possible to trace back from the maximum matrix value to the matrix value related to the maximum matrix value, that is, to trace back to the first correlation value.
  • the backtracking can be continued in a similar manner, so as to backtrack from the first correlation value to the matrix value related to the first correlation value, that is, the second correlation value.
  • the backtracking can be continued in a similar manner to backtrack from the second correlation value to the matrix value related to the second correlation value, that is, the third correlation value. And so on, until the backtracked value is the initial value, the backtracking cannot be continued. In this way, a series of related matrix values can be obtained by backtracking.
  • each matrix value corresponds to a character in the first sentence and a character in the second sentence. Therefore, the first substring of the first sentence and the second substring of the second sentence can be determined through a series of related matrix values obtained by backtracking. The first substring and the second substring are the match part.
  • the original sentence group includes the original sentence
  • the final sentence group includes the final sentence corresponding to the original sentence.
  • Step S18 includes:
  • Step S181 Determine the different characters of the original sentence and the final sentence
  • Step S182 Use wildcards to connect different characters that are not adjacent in the original sentence.
  • the processor 101 is used to determine different characters of the original sentence and the final sentence; and used to connect different characters that are not adjacent in the original sentence by using wildcards.
  • wildcards include but are not limited to: at least one of "*", "?", “-", “+”, and "/”. There is no limitation here.
  • the original sentence is also the unprocessed sentence in the original sentence group.
  • the final sentence is the sentence in which there is no matching part. Therefore, the different characters between the original sentence and the final sentence are the characters in the matching part of the original sentence.
  • step S182 using wildcards to connect different characters that are not adjacent in the original sentence refers to replacing the content between the different characters that are not adjacent in the original sentence with a predetermined number of wildcards.
  • the content between different characters that are not adjacent in the original sentence is replaced with a wildcard.
  • the original sentence group includes two original sentences: "I love you” and “I hate you”. After removing the matching part, two final sentences can be obtained: "love” and "hate”. Among them, the original sentence “I love you” corresponds to the final sentence “love”, and the different characters are “I” and “you”. These two different characters are not adjacent in the original sentence "I love you”. If the wildcard "*" is used to replace the content between "I” and “you” in "I love you” character by character, the template will be obtained Is: "I*you”. The original sentence “I hate you” corresponds to the final sentence “hate", and the different characters are "I” and "you”. These two different characters are not adjacent in the original sentence “I hate you”. If the wildcard character "*” is used to replace the content between "I” and “you” character by character in "I hate you”, the template will be obtained Is: "I ** you”. In this way, a set of original sentence groups will get two templates.
  • the same characters of the original sentence and the final sentence can be determined; wildcards are used to replace the same characters to obtain the template to be processed; multiple consecutive wildcards in the template to be processed are reduced to one to obtain the original sentence group template.
  • the specific method of obtaining the template of the original sentence group is not limited here.
  • step S18 may also include: when the first character of the template is different from the first character of the original sentence, adding a wildcard before the first character of the template to update the template; when the last character of the template is different from the last character of the original sentence, Add a wildcard after the last character of the template to update the template. In this way, the accuracy of the extracted template of the original sentence group is higher.
  • the original sentence group includes two original sentences: "He said I love you” and “She thinks I hate you.” After removing the matching part, you can get two final sentences: "He said love” and "She wants to hate it”.
  • the first character “I” of the template “I*you” is different from the first character "he” of the original sentence "He said I love you”, so add a wildcard before the first character “I” of the template “I*you”. Get by updating the template: "*I*You”.
  • the last character “you” of the template “I*you” is different from the last character "ah” of the original sentence, so add a wildcard after the last character "you” of the template "I*you” to update the template and get: "* ⁇ *you*".
  • the text template extraction method includes:
  • Step S11 group sentences in the text data set to obtain multiple original sentence groups
  • Step S19 After the templates of all the original sentence groups are obtained, the templates of the text data set are filtered from the templates of all the original sentence groups.
  • the processor 101 is used to group sentences in the text data set to obtain multiple original sentence groups; and used to filter out templates of all original sentence groups after obtaining templates of all original sentence groups A template for the text data set.
  • the template for extracting the text data set has higher efficiency and higher accuracy. It can be understood that since the template of the text data set is based on all templates of the original sentence group, the omission of the template of the original sentence group can be avoided, thereby improving the accuracy of the template of the text data set.
  • step S11 two sentences in the text data set may be grouped into one group. In this way, the comparison process can be made simpler, which is beneficial to reduce the execution time of the method.
  • step S12, step S13, step S14, step S17, and step S18 can be performed respectively to obtain the template of each original sentence group, so that it can be filtered from all the templates of the original sentence group A template for the text data set.
  • step S19 includes:
  • Step S191 Perform de-duplication processing on all templates of the original sentence group to obtain multiple candidate templates
  • Step S192 Determine the template score of each template to be selected
  • Step S193 Filter the templates of the text data set from the multiple candidate templates according to the template score.
  • the processor 101 is used to perform deduplication processing on all templates of the original sentence group to obtain multiple candidate templates; and used to determine the template score of each candidate template;
  • the template score filters out the templates of the text data set from multiple candidate templates.
  • the templates of the text data set can be quickly and accurately filtered from the templates of all the original sentence groups.
  • the templates of all original sentence groups may be duplicated and the same.
  • De-duplicating the templates of all original sentence groups can make multiple candidate templates different, thereby avoiding repeated processing of the same template. Conducive to saving computing resources and improving the efficiency of screening.
  • screening based on template scores can quantify the screening criteria, thereby improving the accuracy of screening.
  • step S192 includes:
  • Step S1921 Determine the number of repetitions of each template to be selected
  • Step S1922 Match each candidate template with each sentence in the text data set to determine the number of successful matches for each candidate template in the text data set;
  • Step S1923 Determine the number of non-wildcard characters in each template to be selected
  • Step S1924 Determine the template score of each template to be selected according to at least one of the number of repetitions, the number of successes, and the number.
  • the processor 101 is used to determine the number of repetitions of each candidate template; and used to match each candidate template with each sentence in the text data set to determine whether each candidate template is The number of successful matches in the text data set; and the number of non-wildcard characters used to determine each template to be selected; and the template used to determine each template to be selected based on at least one of the number of repetitions, the number of successes, and the number Points.
  • step S1921 the determination of the number of repetitions of each template to be selected can be performed at the same time as step S191 on all templates of the original sentence group. In this way, determining the number of repetitions of each template to be selected in the process of de-duplication can shorten the execution time of the method as a whole.
  • each candidate template is matched with each sentence in the text data set, that is, each candidate template is fully matched in the text data set.
  • the number of successes corresponding to the candidate template is increased by 1; the candidate template does not match a sentence in the text data set.
  • the number of successes for each template to be selected can be determined.
  • the template score of each template to be selected may be determined according to one, two, or all of the number of repetitions, the number of successes, and the number.
  • the template score of each template to be selected is determined according to the number of repetitions, the number of successes, and the number. In this way, the template score is based on the three dimensions of the number of repetitions, the number of successes, and the number, so that the template score can more accurately quantify the situation of the candidate template in the text data set, so that the template of the text data set filtered based on the template score more precise.
  • the following formula can be used to determine the template score according to the number of repetitions, the number of successes, and the number:
  • S is the template score of the template to be selected
  • I is the number of non-wildcard characters in the template to be selected
  • m is the number of successful matching of the template to be selected in the text data set
  • n is the number of repetitions of the template to be selected.
  • the template of the text data set is made to cover the text data set as much as possible, and the information is lost as little as possible.
  • the template to be selected is "*"
  • all data can be matched, but all information of the template to be selected will be lost.
  • the template to be selected is "AAAA”. Although all the information is saved, only one piece of data can be matched.
  • the above formula (3) is used to determine the template score to evaluate the effectiveness of the template to be selected, so as to filter out the templates of the text data set from the plurality of templates to be selected, which can make the effect of template extraction better.
  • the sum of the number of repetitions, the number of successes, and the number can also be used as the template score; the product of the number of repetitions, the number of successes, and the number can also be used as the template score.
  • the specific method of determining the template score is not limited here.
  • the template score of each candidate template can be determined based on the number of repetitions; in other other embodiments, the template of each candidate template can be determined based on the number of repetitions and the number of successes. Points. There is no limitation here.
  • step S193 includes:
  • Step S1931 Sort the multiple candidate templates according to the order of template scores from high to low to obtain the serial numbers of the multiple candidate templates;
  • Step S1932 Use the candidate template whose serial number is less than the preset serial number as the template of the text data set.
  • the processor 101 is configured to sort the multiple candidate templates in the order of the template scores from high to low to obtain the serial numbers of the multiple candidate templates; Set the candidate template of the serial number as the template of the text data set.
  • the template of the text data set can be filtered from multiple candidate templates according to the template score, which is more efficient. It can be understood that the templates of the text data set filtered in this way are the preset number of templates to be selected in the order of the template scores from high to low.
  • the preset number may be determined based on input information. In this way, the user can adjust the number of templates of the text data set as needed.
  • the preset number can also be determined based on the number of templates to be selected. For example, a predetermined ratio of the number of templates to be selected is used as the preset number.
  • the preset number can also be determined based on the number of sentences in the text data set. For example, a predetermined ratio of the number of sentences in the text data set is used as the preset number.
  • the specific method for determining the preset number is not limited here.
  • the preset number is 2 and the number of templates to be selected is 5, which are: "Welcome to*", “Today's temperature is from *degree to *degree”, “You have used up this month* Please recharge in time”, “Today's *weather*”, “*Welcome”.
  • "Welcome to*” template has a score of 3
  • "Today's temperature is *degree to *degree” has a score of 5
  • "Your *has been used up this month, please recharge in time” template has a score of 5 10.
  • the template score for "Today *Weather*” is 7, and the template score for "*Welcome” is 4.
  • the embodiment of the present application provides a non-volatile computer-readable storage medium containing computer-executable instructions.
  • the processor 101 is caused to execute the above-mentioned text template extraction. method.
  • Step S12 Obtain a set of original sentence groups in the text data set, the original sentence group includes multiple sentences to be processed;
  • Step S13 Build a matching matrix of multiple sentences to be processed;
  • Step S14 Determine a plurality of sentences to be processed according to the matching matrix Whether there is a matching part in the processing sentence;
  • Step S17 When it is determined that there is a matching part in the multiple sentences to be processed according to the matching matrix, remove the matching part in each sentence to be processed to update the multiple sentences to be processed, and enter the determination Step of matching matrix of multiple sentences to be processed;
  • Step S18 When it is determined according to the matching matrix that there is no matching part in the multiple sentences to be processed, the current multiple sentences to be processed are regarded as the final sentence group, and according to the wildcard and final sentence The group processes the original sentence group to obtain the template of the original sentence group.
  • the storage medium of the embodiment of the present application obtains a set of original sentence groups from a text data set.
  • the original sentence group includes multiple sentences to be processed, and then directly extracts and processes the multiple sentences to be processed to obtain the template of the original sentence group.
  • Labeling and coding the sentences to be processed before the extraction process not only avoids the errors caused by labeling and coding, but also reduces manual intervention, which is beneficial to improve the efficiency and accuracy of template extraction.
  • the present invention proposes a text template extraction method based on the Smith-Waterman Algorithm (Smith-Waterman Algorithm, SW algorithm).
  • This method processes the text based on the dimension of the word to extract the template, which overcomes the extra cost and possible word segmentation errors caused by word segmentation.
  • manual intervention is minimized, and the subjective errors and expenses caused by manpower are greatly reduced, so that the results are as objective and efficient as possible.
  • the SW algorithm is an algorithm used in bioinformatics to find similar regions between two nucleotide sequences or protein sequences. This method applies the SW algorithm to the extraction of text templates, which makes the extraction efficiency and effect better.
  • the processes in the methods of the above embodiments can be implemented by computer programs instructing relevant hardware.
  • the programs can be stored in a non-volatile computer-readable storage medium.
  • the storage medium may be a magnetic disk, an optical disc, a read-only memory (Read-Only Memory, ROM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文本模板提取方法、电子设备(100)和存储介质。文本模板提取方法包括:获取文本数据集中的一组原始语句组,原始语句组包括多个待处理语句(S12);建立多个待处理语句的匹配矩阵(S13);根据匹配矩阵确定多个待处理语句中是否存在匹配部分(S14);在根据匹配矩阵确定多个待处理语句中存在匹配部分时,移除每个待处理语句中的匹配部分,以更新多个待处理语句(S17),并进入确定多个待处理语句的匹配矩阵的步骤;在根据匹配矩阵确定多个待处理语句中不存在匹配部分时,将当前的多个待处理语句作为最终语句组,并根据通配符和最终语句组处理原始语句组,以得到原始语句组的模板(S18)。

Description

文本模板提取方法、电子设备和存储介质 技术领域
本申请涉及数据处理技术领域,特别涉及一种文本模板提取方法、电子设备和存储介质。
背景技术
大数据时代产生了大量的格式化信息。这些格式化信息蕴含着大量的用户的使用习惯、行为习惯等目标信息,具有较高价值。相关技术通常从这些格式化信息中提取模板,以通过模板更有效、更准确地从格式化信息中获取目标信息。然而,相关技术在从格式化信息中提取模板的过程中,人工介入的程度较高,模板提取的效率和准确性较低。
发明内容
本申请提供了一种文本模板提取方法、电子设备和存储介质。
本申请实施方式提供了一种文本模板提取方法。所述文本模板提取方法包括:
获取文本数据集中的一组原始语句组,所述原始语句组包括多个待处理语句;
建立多个所述待处理语句的匹配矩阵;
根据所述匹配矩阵确定多个所述待处理语句中是否存在匹配部分;
在根据所述匹配矩阵确定多个所述待处理语句中存在匹配部分时,移除每个所述待处理语句中的所述匹配部分,以更新多个所述待处理语句,并进入所述确定多个所述待处理语句的匹配矩阵的步骤;
在根据所述匹配矩阵确定多个所述待处理语句中不存在匹配部分时,将当前的多个所述待处理语句作为最终语句组,并根据通配符和所述最终语句组处理所述原始语句组,以得到所述原始语句组的模板。
本申请实施方式提供了一种电子设备。所述电子设备包括处理器,所述处理器用于获取文本数据集中的一组原始语句组,所述原始语句组包括多个待处理语句;及用于建立多个所述待处理语句的匹配矩阵;及用于根据所述匹配矩阵确定多个所述待处理语句中是否存在匹配部分;及用于在根据所述匹配矩阵确定多个所述待处理语句中存在匹配部分时,移除每个所述待处理语句中的所述匹配部分,以更新多个所述待处理语句,并进入所述确定多个所述待处理语句的匹配矩阵的步骤;以及用于在根据所述匹配矩阵确定多个所述待处理语句中不存在匹配部分时,将当前的多个所述待处理语句作为最终语句组,并根据通配符和所述最终语句组处理所述原始语句组,以得到所述原始语句组的模板。
本申请实施方式提供了一种包含计算机可执行指令的非易失性计算机可读存储介质,当所述计算机可执行指令被一个或多个处理器执行时,使得所述处理器执行上述的文本模板提取方法。
本申请实施方式的文本模板提取方法、电子设备和存储介质,从文本数据集中获取一组原始语句组,原始语句组包括多个待处理语句,再直接对多个待处理语句进行提取处理,以得到原始语句组的模板,无需在提取处理前对待处理语句进行标注和编码,既避免了标注和编码产生的误差,又减少了人工介入,有利于提高模板提取的效率和准确性。
附图说明
本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:
图1是本申请实施方式的文本模板提取方法的流程示意图;
图2是本申请实施方式的电子设备的模块示意图;
图3是本申请另一实施方式的文本模板提取方法的流程示意图;
图4是本申请又一实施方式的文本模板提取方法的流程示意图;
图5是本申请再一实施方式的文本模板提取方法的流程示意图;
图6是本申请实施方式的文本模板提取方法的匹配矩阵的建立过程示意图;
图7是本申请另一实施方式的文本模板提取方法的流程示意图;
图8是本申请又一实施方式的文本模板提取方法的流程示意图;
图9是本申请再一实施方式的文本模板提取方法的流程示意图;
图10是本申请另一实施方式的文本模板提取方法的流程示意图;
图11是本申请又一实施方式的文本模板提取方法的流程示意图;
图12是本申请再一实施方式的文本模板提取方法的流程示意图。
具体实施方式
下面详细描述本申请的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,旨在用于解释本申请,而不能理解为对本申请的限制。
请参阅图1,本申请实施方式提供了一种文本模板提取方法。文本模板提取方法包括:
步骤S12:获取文本数据集中的一组原始语句组,原始语句组包括多个待处理语句;
步骤S13:建立多个待处理语句的匹配矩阵;
步骤S14:根据匹配矩阵确定多个待处理语句中是否存在匹配部分;
步骤S17:在根据匹配矩阵确定多个待处理语句中存在匹配部分时,移除每个待处理语句中的匹配部分,以更新多个待处理语句,并进入确定多个待处理语句的匹配矩阵的步骤;
步骤S18:在根据匹配矩阵确定多个待处理语句中不存在匹配部分时,将当前的多个待处理语句作为最终语句组,并根据通配符和最终语句组处理原始语句组,以得到原始语句组的模板。
请参阅图2,本申请实施方式提供了一种电子设备100。电子设备100包括处理器101,处理器101用于获取文本数据集中的一组原始语句组,原始语句组包括多个待处理语句;及用于建立多个待处理语句的匹配矩阵;及用于根据匹配矩阵确定多个待处理语句中是否存在匹配部分;及用于在根据匹配矩阵确定多个待处理语句中存在匹配部分时,移除每个待处理语句中的匹配部分,以更新多个待处理语句,并进入确定多个待处理语句的匹配矩阵的步骤;以及用于在根据匹配矩阵确定多个待处理语句中不存在匹配部分时,将当前的多个待处理语句作为最终语句组,并根据通配符和最终语句组处理原始语句组,以得到原始语句组的模板。
本申请实施方式的文本模板提取方法和电子设备100,从文本数据集中获取一组原始语句组,原始语句组包括多个待处理语句,再直接对多个待处理语句进行提取处理,以得到原始语句组的模板,无需在提取处理前对待处理语句进行标注和编码,既避免了标注和编码产生的误差,又减少了人工介入,有利于提高模板提取的效率和准确性。
可以理解,大数据时代产生了大量的格式化信息。这些格式化信息蕴含着大量的用户的使用习惯、行为习惯等目标信息,具有较高价值。相关技术通常从这些格式化信息中提取模板,以通过模板更有效、更准确地从格式化信息中获取目标信息。
例如,相关技术通常以纯人工的方式从格式化信息中提取模板。然而,当下的信息量已呈井喷之势,通过纯人工提取模板,费时费力。纯人工的方式仅能在小数据集上使用,对于工业界中的实际应用而言,不具备应用价值。
又如,相关技术可先将待处理的文本分词向量化(embedding),例如通过word2vector算法向量化,然后进行聚类,再对每一类抽样,最后由人工提取出模板。这种方法在一定程度上减少了人力。然而,由于向量化之前需要先对文本分词,因此分词的开销不可避免。而且,由于分词源词典的不同,不同的分词算法可能会得到不同的结果。另外,受限于分词词典,分词结果的准确性大大地影响着本方案的准确程度。此外,采用聚类的方法将相同模板的数据聚到一起,受到聚类算法的限制。不同的聚类算法可能会得到不同的结果。同时聚类算法是随机初始化的无监督算法,在未知分布的数据集上进行聚类时,很有可能出现偏差较大的情况,大大增加了后续人工提取的成本。
换言之,相关技术在从格式化信息中提取模板的过程中,人工介入的程度较高,模板提取的效率和准确性较低。
而本申请实施方式的文本模板提取方法和电子设备100,在语句的维度,基于语句的字,直 接对多个待处理语句进行提取处理,以得到原始语句组的模板,无需在提取处理前对待处理语句进行分词、标注和编码,既避免了分词、标注和编码产生的误差和额外开销,又减少了人工介入,有利于提高模板提取的效率和准确性。
在步骤S12中,文本数据集可为格式化语句的数据集。或者说,多个待处理语句可为格式化语句。可以理解,格式化语句是指语句具备固定的格式,这些格式与本申请实施方式的方法所要提取出的模板相关。如此,提高从多个待处理语句中提取出模板的可能性,避免由于语句为非格式化语句而导致无法从待处理语句中提取出模板。
格式化语句可为短信息,例如验证码类的短信、通知类的短信。或者说,本申请实施方式的文本模板提取方法可应用于短信息的模板提取的场景。
可以理解,本申请实施方式的文本模板提取方法还可应用于push消息的模板提取、邮件主题的模板提取、垃圾邮件过滤规则生成等场景。在此不进行限定。
一组原始语句组所包括的多个待处理语句的数量可以为:2个、3个、4个、5个或其他数值。在此不进行限定。
在步骤S13中,匹配矩阵的具体形式可与待处理语句的数量对应。例如,在待处理语句的数量为两个的情况下,匹配矩阵可为二维矩阵;又如,在待处理语句的数量为三个的情况下,匹配矩阵可为三维矩阵;再如,在待处理语句的数量为四个的情况下,匹配矩阵可为四维矩阵。
在步骤S14中,匹配部分可指每个待处理语句中的相同部分、相似部分、对应部分中的至少一种。可以通过用户输入的输入信息确定具体的判断标准,在此不进行限定。如此,可以灵活设置具体的判断标准,提高文本模板提取方法的适用性。
为方便解释和说明,接下来以匹配部分指每个待处理语句中的相同部分为例进行解释和说明。
在步骤S17和步骤S18中,根据匹配矩阵确定多个待处理语句中存在匹配部分,是指,每个待处理语句中均包括匹配部分。换言之,匹配部分是每个待处理语句的子字符串。
例如,待处理语句的数量为三个,分别为:“欢迎来到上海”、“欢迎来到深圳”、“欢迎来到北京”,则可确定多个待处理语句中存在匹配部分,匹配部分为“欢迎来到”。
又如,待处理语句的数量为三个,分别为:“上海”、“深圳”、“北京”,则可确定多个待处理语句中不存在匹配部分。
可以理解,由于本申请实施方式的文本模板提取方法可以提高模板提取的准确性,因此,利用模板从待处理信息中获取目标信息的准确性也会提高,也即是可以更有效、更准确地从格式化信息中获取目标信息。
例如,原始语句组包括三个待处理语句,分别为:“欢迎来到上海”、“欢迎来到深圳”、“欢迎来到北京”。经过本申请实施方式的方法处理后,得到的原始语句组的模板是:“欢迎来到*”。
这样,就可以通过“欢迎来到*”的模板,提取目标信息,即用户的出行地点。而且可以避免噪音对提取目标信息的影响。噪音例如为“今天北京的气温是10度到20度”、“您本月的流量已经用完请及时充值”。
又如,原始语句组包括两个待处理语句,分别为:“今天北京的气温是10度到20度”、“今天深圳的气温是5度到15度”。经过本申请实施方式的方法处理后,得到的原始语句组的模板是:“今天*的气温是*度到*度”
这样,就可以通过“今天*的气温是*度到*度”的模板,提取目标信息,即用户出行地的气温范围。而且可以避免噪音对提取目标信息的影响。噪音例如为“欢迎来到北京”、“您本月的流量已经用完请及时充值”。
再如,原始语句组包括两个待处理语句,分别为:“您本月的流量已经用完请及时充值”、“您本月的话费已经用完请及时充值”。经过本申请实施方式的方法处理后,得到的原始语句组的模板是:“您本月的*已经用完请及时充值”
这样,就可以通过“您本月的*已经用完请及时充值”的模板,提取目标信息,即用户需要充值的对象。而且可以避免噪音对提取目标信息的影响。噪音例如为“欢迎来到北京”、“今天 北京的气温是10度到20度”。
在步骤S18中,根据通配符和最终语句组处理原始语句组,使得原始语句组的模板包括通配符。而通配符可以通用匹配,以对数据进行模糊搜索。这样,后续就可以利用原始语句组的搜索数据,从而使得搜索到的数据满足模板。可以理解,原始语句组的模板,可为正则表达式。
请参阅图3,在某些实施方式中,步骤S13包括:
步骤S131:确定每个待处理语句中的每个待处理字符与其他待处理语句中的每个待处理字符的匹配分值;
步骤S132:根据匹配分值建立匹配矩阵。
在某些实施方式中,处理器101用于确定每个待处理语句中的每个待处理字符与其他待处理语句中的每个待处理字符的匹配分值;以及用于根据匹配分值确定匹配矩阵。
如此,通过待处理字符的匹配分值来建立多个待处理语句的匹配矩阵,可以在字符级别对匹配程度进行量化,从而使得匹配矩阵的建立更加细致、高效、准确。
具体地,在步骤S131中,对于每个待处理语句,可根据该待处理语句的字符顺序,依次确定每个待处理字符与其他待处理语句中的每个待处理字符的匹配分值。如此,使得确定匹配分值的过程更加规律,避免由于确定过程混乱而引起的结果错误。
例如,待处理语句的数量为两个,分别为:“请向北”、“请向南”。可先确定“请向北”中的“请”,与“请向南”中的“请”的匹配分值;再确定“请向北”中的“请”,与“请向南”中的“向”的匹配分值;再确定“请向北”中的“请”,与“请向南”中的“南”的匹配分值;
然后,可确定“请向北”中的“向”,与“请向南”中的“请”的匹配分值;再确定“请向北”中的“向”,与“请向南”中的“向”的匹配分值;再确定“请向北”中的“向”,与“请向南”中的“南”的匹配分值;
然后,可确定“请向北”中的“北”,与“请向南”中的“请”的匹配分值;再确定“请向北”中的“北”,与“请向南”中的“向”的匹配分值;再确定“请向北”中的“北”,与“请向南”中的“南”的匹配分值。
在步骤S132中,根据匹配分值建立匹配矩阵,可将匹配分值直接作为匹配矩阵的矩阵值;也可根据匹配分值计算匹配矩阵的矩阵值。在此不对根据匹配分值建立匹配矩阵的具体方式进行限定。
请参阅图4,在某些实施方式中,多个待处理语句包括第一语句和第二语句,步骤S131包括:
步骤S1311:在第一语句的第一当前字符与第二语句的第二当前字符匹配时,将第一预设分值作为第一当前字符和第二当前字符的匹配分值;
步骤S1312:在第一当前字符与第二当前字符不匹配时,将第二预设分值作为第一当前字符和第二当前字符的匹配分值,第二预设分值小于第一预设分值。
在某些实施方式中,多个待处理语句包括第一语句和第二语句,处理器101用于在第一语句的第一当前字符与第二语句的第二当前字符匹配时,将第一预设分值作为第一当前字符和第二当前字符的匹配分值;以及用于在第一当前字符与第二当前字符不匹配时,将第二预设分值作为第一当前字符和第二当前字符的匹配分值,第二预设分值小于第一预设分值。
如此,通过第一预设分值和第二预设分值,实现匹配分值的确定,可以避免匹配分值的数值种类过多,可以减少计算的复杂度,有利于缩短文本模板提取方法的执行时间。
请注意,第一当前字符与第二当前字符匹配,可指第一当前字符与第二当前字符相同、相似、对应中的至少一种情况。可以通过用户输入的输入信息确定具体的匹配标准,在此不进行限定。如此,可以灵活设置具体的匹配标准,提高文本模板提取方法的适用性。
为方便解释和说明,接下来以第一当前字符与第二当前字符匹配指第一当前字符与第二当前字符相同为例进行解释和说明。
在本实施方式中,第一语句为:A=a 1a 2a 3…a n;第二语句为:B=b 1b 2b 3…b m。其中,n和m分别为第一语句和第二语句的长度。可通过以下公式确定匹配分值:
Figure PCTCN2020092871-appb-000001
其中,i=1,2,3……n;j=1,2,3……m。+3为第一预设分值;-3为第二预设分值;a i为第一当前字符;b j为第二当前字符;S(a i,b j)为第一当前字符与第二当前字符的匹配分值。
也即是说,在a i=b j时,即,第一当前字符与第二当前字符相同时,第一当前字符与第二当前字符的匹配分值为+3,也即是第一预设分值;在a i≠b j时,即,第一当前字符与第二当前字符不相同时,第一当前字符与第二当前字符的匹配分值为-3,也即是第二预设分值。
可以理解,在其他的示例中,第一预设分值也可为+1、+2、+4或其他数值;第二预设分值也可为-1、-2、-5或其他数值;第一预设分值和第二预设分值可互为相反数,也可不互为相反数。在此不对第一预设分值和第二预设分值的具体数值和具体关系进行限定。
请参阅图5,在某些实施方式中,多个待处理语句包括第一语句和第二语句,第一语句包括第一当前字符,第二语句包括第二当前字符,匹配分值包括第一当前字符与第二当前字符的当前匹配分值,匹配矩阵包括当前位置,步骤S132包括:
步骤S1320:以预设的初始值初始化匹配矩阵的第一行和第一列;
步骤S1321:根据当前匹配分值,和当前位置的左上方位置的矩阵值,确定当前位置的第一待选值;
步骤S1322:将当前位置的每个上方位置的矩阵值减去第一惩罚值,以得到每个上方惩罚值,并将上方惩罚值中的最大值作为当前位置的第二待选值;
步骤S1323:将当前位置的每个左方位置的矩阵值减去第二惩罚值,以得到每个左方惩罚值,并将左方惩罚值中的最大值作为当前位置的第三待选值;
步骤S1324:将第一待选值、第二待选值、第三待选值和初始值中的最大值,作为当前位置的矩阵值。
在某些实施方式中,多个待处理语句包括第一语句和第二语句,第一语句包括第一当前字符,第二语句包括第二当前字符,匹配分值包括第一当前字符与第二当前字符的当前匹配分值,匹配矩阵包括当前位置,处理器101用于以预设的初始值初始化匹配矩阵的第一行和第一列;及用于根据当前匹配分值,和当前位置的左上方位置的矩阵值,确定当前位置的第一待选值;及用于将当前位置的每个上方位置的矩阵值减去第一惩罚值,以得到每个上方惩罚值,并将上方惩罚值中的最大值作为当前位置的第二待选值;及用于将当前位置的每个左方位置的矩阵值减去第二惩罚值,以得到每个左方惩罚值,并将左方惩罚值中的最大值作为当前位置的第三待选值;以及用于将第一待选值、第二待选值、第三待选值和初始值中的最大值,作为当前位置的矩阵值。
如此,根据当前匹配分值和矩阵值确定第一待选值,根据矩阵值和惩罚值确定第二待选值和第三待选值,从而根据第一待选值、第二待选值、第三待选值和初始值确定当前位置的矩阵值,能够实现匹配矩阵的建立。
而且,由于当前矩阵值与其他位置的矩阵值相关,因此,当前矩阵值,能够反映从第一语句的第一个字符至第一当前字符的字符串,与第二语句的第一个字符至第二当前字符的字符串,的匹配程度。这样,避免了字符与字符间的孤立匹配,使得匹配矩阵的矩阵值能够衡量子串与子串是否匹配,进而使得根据匹配矩阵确定第一语句和第二语句中是否存在匹配部分的准确性更高。
具体地,在步骤S1320中,预设的初始值可为-3,-2,-1,0,+1,+2,+3或其他数值。在此不进行限定。在本实施方式中,初始值为0,如此,可以减少后续计算的复杂度,从而缩短方法的执行时长。
在步骤S1321中,本实施方式中,可将当前匹配分值与左上方位置的矩阵值之和作为第一待选值。如此,使得当前位置的矩阵值与左上方位置的矩阵值相关。
可以理解,在其他的一些实施方式中,也可将当前匹配分值与左上方位置的矩阵值之积作为第一待选值;在其他的另一些实施方式中,也可将当前匹配分值与左上方位置的矩阵值代入预设公式,并将所得到的值作为第一待选值。在此不对步骤S1321的具体方式进行限定。
在步骤S1322中,可将当前位置的每个上方位置的矩阵值减去对应的第一惩罚值,以得到每个上方惩罚值。例如,将当前位置的第一上方位置的矩阵值减去第一惩罚子值;将当前位置的第二上方位置的矩阵值减去第二惩罚子值;将当前位置的第三上方位置的矩阵值减去第三惩罚子值。如此,可以对当前位置的每个上方位置的矩阵值进行不同程度的惩罚,使得惩罚更加灵活。
类似地,在步骤S1323中,可将当前位置的每个左方位置的矩阵值减去对应的第二惩罚值,以得到每个左方惩罚值。例如,将当前位置的第一左方位置的矩阵值减去第一惩罚子值;将当前位置的第二左方位置的矩阵值减去第二惩罚子值;将当前位置的第三左方位置的矩阵值减去第三惩罚子值。如此,可以对当前位置的每个左方位置的矩阵值进行不同程度的惩罚,使得惩罚更加灵活。
当然,也可将当前位置的每个上方位置的矩阵值减去相同的第一惩罚值,以得到每个上方惩罚值。也可将当前位置的每个左方位置的矩阵值减去相同的第二惩罚值,以得到每个左方惩罚值。
第一惩罚值可为-3,-2,-1,0,+1,+2,+3或其他数值。第二惩罚值可为-3,-2,-1,0,+1,+2,+3或其他数值。第一惩罚值和第二惩罚值可以相同,也可以不同。在此不对第一惩罚值和第二惩罚值进行限定。
在本实施方式中,第一惩罚值和第二惩罚值相同。如此,可以减少后续计算的复杂度,从而缩短方法的执行时长。
请参阅图6,可以预设的初始值,即0,初始化匹配矩阵H的第一行和第一列。即:H k,0=H 0,l=0;其中,(0≤k≤n,0≤l≤m)。
然后,通过如下公式确定匹配矩阵H的当前位置的矩阵值为H i,j
Figure PCTCN2020092871-appb-000002
其中,H i,j为匹配矩阵H的当前位置的矩阵值,H i-1,j-1为当前位置的左上方位置的矩阵值,S(a i,b j)为当前匹配分值,H i-k,j为当前位置的每个上方位置的矩阵值,W k为第一惩罚值,k为1至i的遍历,H i,j-l为当前位置的每个左方位置的矩阵值,l为1至j的遍历,W l为第二惩罚值,0为初始值。
换言之,可通过公式(2)的H i-1,j-1+S(a i,b j),求得第一待选值;可通过公式(2)的max k≥1{H i-k,j-W k},求得第二待选值;可通过公式(2)的max l≥1{H i,j-l-W l},求得第三待选值;可通过公式(2)的0为初始值。然后,可将第一待选值、第二待选值、第三待选值和初始值中的最大值,作为当前位置的矩阵值。
可以理解,由于当前位置的矩阵值,与当前位置的左上方位置的矩阵值、当前位置的每个上方位置的矩阵值、以及当前位置的每个左方位置的矩阵值相关,因此,匹配矩阵H的填充顺序是:从左至右,从上至下。
可以理解,由于以0初始化了匹配矩阵H的第一行和第一列。因此,匹配矩阵H的大小为n+1行,m+1列。可以理解,H i,j=0表示a 1a 2a 3…a i与b 1b 2b 3…b j无相似性。
请参阅图7,在某些实施方式中,步骤S14包括:
步骤S141:在匹配矩阵的矩阵值全为预设的初始值时,确定多个待处理语句中不存在匹配部分;
步骤S142:在匹配矩阵的矩阵值不全为预设的初始值时,确定匹配矩阵的最大矩阵值;
步骤S143:根据最大矩阵值回溯匹配矩阵,以确定多个待处理语句中的匹配部分。
在某些实施方式中,处理器101用于在匹配矩阵的矩阵值全为预设的初始值时,确定多个待处理语句中不存在匹配部分;及用于在匹配矩阵的矩阵值不全为预设的初始值时,确定匹配矩阵的最大矩阵值;以及用于根据最大矩阵值回溯匹配矩阵,以确定多个待处理语句中的匹配部分。
如此,实现根据匹配矩阵确定多个待处理语句中是否存在匹配部分,较为简单,有利于提高模板提取的提取效率和准确性。可以理解,“匹配矩阵的矩阵值不全为预设的初始值”,是指,匹配矩阵的矩阵值中存在与初始值不同的值。在一个例子中,预设的初始值为0,匹配矩阵的矩阵值全为0,确定多个待处理语句中不存在匹配部分。在另一个例子中,预设的初始值为0,匹配矩阵的矩阵值除了0还有3,则可确定多个待处理语句中存在匹配部分。
具体地,在步骤S142和步骤S143中,在匹配矩阵的最大矩阵值的数量为一个的情况下,可根据该一个最大矩阵值回溯匹配矩阵,以确定多个待处理语句中的一个匹配部分。在匹配矩阵的最大矩阵值为多个的情况下,可分别根据多个最大矩阵值回溯匹配矩阵,以确定多个待处理语句中的多个匹配部分。在此不进行限定。为方便解释和说明,接下来以匹配矩阵的最大矩阵值的数量为一个进行解释和说明。
具体地,在步骤S143中,可根据最大矩阵值和上述的公式(2)回溯匹配矩阵,以确定多个待处理语句中的匹配部分。如此,可以准确高效地确定匹配部分。
可以理解,基于公式(2),当前位置的矩阵值H i,j是第一待选值、第二待选值、第三待选值和初始值中的最大值。而在匹配矩阵的矩阵值不全为预设的初始值时,匹配矩阵的最大矩阵值必然不为初始值。
所以,匹配矩阵的最大矩阵值,必然与其左上方位置的矩阵值、其上方位置的矩阵值或其左方位置的矩阵值中的一个相关。因此,可从最大矩阵值回溯到与最大矩阵值相关的矩阵值,也即是回溯到第一相关值。
接着,对于第一相关值,可以类似的方式继续回溯,以从第一相关值回溯到与第一相关值相关的矩阵值,即第二相关值。对于第二相关值,可以类似的方式继续回溯,以从第二相关值回溯到与第二相关值相关的矩阵值,即第三相关值。以此类推,直到回溯到的值为初始值,无法继续回溯。这样,就可以通过回溯得到一串相关的矩阵值。
而每一个矩阵值,都对应于第一语句的一个字符,和第二语句的一个字符。故,可通过回溯得到的一串相关的矩阵值,确定第一语句的第一子字符串,和第二语句的第二子字符串,第一子字符串和第二子字符串即为匹配部分。
请参阅图8,在某些实施方式中,原始语句组包括原始语句,最终语句组包括与原始语句对应的最终语句,步骤S18包括:
步骤S181:确定原始语句与最终语句的不同字符;
步骤S182:利用通配符连接在原始语句中不相邻的不同字符。
在某些实施方式中,处理器101用于确定原始语句与最终语句的不同字符;以及用于利用通配符连接在原始语句中不相邻的不同字符。
如此,可以实现根据通配符和最终语句组处理原始语句组,以得到原始语句组的模板。具体地,通配符包括但不限于:“*”、“?”、“-”、“+”、“/”中的至少一种。在此不进行限定。
可以理解,原始语句也即是原始语句组中未经处理的语句。最终语句也即是不存在匹配部分的语句。故,原始语句与最终语句的不同字符,也即是原始语句中的匹配部分的字符。
若在原始语句中连续的两个不同字符相邻,则可确定在原始语句中该连续的两个不同字符之间没有其他的字符。若在原始语句中连续的两个不同字符不相邻,则可确定在原始语句中该连续的两个不同字符之间存在其他字符,而这些其他字符,是原始语句组中不匹配的部分。故,可利用通配符连接在原始语句中不相邻的不同字符,以表示该两个不同字符之间存在着不匹配部分。
在步骤S182中,利用通配符连接在原始语句中不相邻的不同字符,是指,将在原始语句中不相邻的不同字符之间的内容用预定数量的通配符替换。在本实施方式中,将在原始语句中不相邻的不同字符之间的内容用一个通配符替换。
如此,可以保证一组原始语句组得到一个模板。可以理解,不同的原始语句中,不相邻的不同字符之间的内容,即不匹配部分,的长度可能不同,如果将在原始语句中不相邻的不同字符之间的内容用通配符逐字符替换,容易导致一组原始语句得到多个模板。
例如,原始语句组包括两个原始语句:“我爱你”、“我讨厌你”。移除匹配部分后,可得到两个最终语句:“爱”、“讨厌”。其中,原始语句“我爱你”和最终语句“爱”对应,不同字符是“我”、“你”。这两个不同字符在原始语句“我爱你”中不相邻,若用通配符“*”在“我爱你”中逐字符替换“我”、“你”之间的内容,则得到的模板是:“我*你”。而原始语句“我讨厌你”和最终语句“讨厌”对应,不同字符是“我”、“你”。这两个不同字符在原始语句“我讨厌你”中不相邻,若用通配符“*”在“我讨厌你”中逐字符替换“我”、“你”之 间的内容,则得到的模板是:“我**你”。这样,一组原始语句组,就会得到两个模板。
而如果将在原始语句中不相邻的不同字符之间的内容用一个通配符替换,那么,基于对应的原始语句“我爱你”和最终语句“爱”,所得到的模板是:“我*你”。基于对应的原始语句“我讨厌你”和最终语句“讨厌”,所得到的模板也是:“我*你”。这样,就保证了一组原始语句组得到一个模板。
当然,在其他的实施方式中,可以确定原始语句与最终语句的相同字符;利用通配符替换相同字符以得到待处理模板;将待处理模板中连续的多个通配符缩减为一个以得到原始语句组的模板。如此,也可以实现根据通配符和最终语句组处理原始语句组,以得到原始语句组的模板。在此不对得到原始语句组的模板的具体方式进行限定。
另外,步骤S18还可包括:在模板的首字符与原始语句的首字符不同时,在模板的首字符前添加通配符,以更新模板;在模板的尾字符与原始语句的尾字符不同时,在模板的尾字符后添加通配符,以更新模板。如此,使得提取出的原始语句组的模板的准确性更高。
例如,原始语句组包括两个原始语句:“他说我爱你啊”、“她想我讨厌你吧”。移除匹配部分后,可得到两个最终语句:“他说爱啊”、“她想讨厌吧”。
其中,原始语句“他说我爱你啊”和最终语句“他说爱啊”对应,不同字符是“我”、“你”。将在原始语句中不相邻的不同字符之间的内容用一个通配符替换,那么,所得到的模板是:“我*你”。
而模板“我*你”的首字符“我”与原始语句“他说我爱你啊”的首字符“他”不同,则在模板“我*你”的首字符“我”前添加通配符,以更新模板而得到:“*我*你”。模板“我*你”的尾字符“你”与原始语句的尾字符“啊”不同,则在模板“我*你”的尾字符“你”后添加通配符,以更新模板而得到:“*我*你*”。
类似地,原始语句“她想我讨厌你吧”和最终语句“她想讨厌吧”对应,不同字符是“我”、“你”。将在原始语句中不相邻的不同字符之间的内容用一个通配符替换,那么,所得到的模板是:“我*你”。
而模板“我*你”的首字符“我”与原始语句“她想我讨厌你吧”的首字符“她”不同,则在模板“我*你”的首字符前添加通配符,以得到:“*我*你”。模板“我*你”的尾字符“你”与原始语句的尾字符“吧”不同,则在模板“我*你”的尾字符后添加通配符,以得到:“*我*你*”。
这样,原始语句组的模板是:“*我*你*”。
请注意,以上仅为示例,并不代表对根据通配符和最终语句组处理原始语句组,以得到原始语句组的模板限定。
请参阅图9,在某些实施方式中,文本模板提取方法包括:
步骤S11:将文本数据集中的语句分组,以得到多个原始语句组;
步骤S19:在得到全部原始语句组的模板后,从全部原始语句组的模板中筛选出文本数据集的模板。
在某些实施方式中,处理器101用于将文本数据集中的语句分组,以得到多个原始语句组;以及用于在得到全部原始语句组的模板后,从全部原始语句组的模板中筛选出文本数据集的模板。
如此,实现提取文本数据集的模板,效率较高,准确性也较高。可以理解,由于文本数据集的模板基于全部的原始语句组的模板,因此,可以避免原始语句组的模板的遗漏,从而提高文本数据集的模板的准确性。
具体地,在步骤S11中,可将文本数据集中的两个语句分为一组。如此,可以使得比对的过程更加简单,有利于减少方法的执行时长。
可以理解,对于每个原始语句组,可分别执行步骤S12、步骤S13、步骤S14、步骤S17和步骤S18,以得到每个原始语句组的模板,从而可以从全部的原始语句组的模板中筛选出文本数据集的模板。
请参阅图10,在某些实施方式中,步骤S19包括:
步骤S191:对全部原始语句组的模板进行去重处理,以得到多个待选模板;
步骤S192:确定每个待选模板的模板分值;
步骤S193:根据模板分值从多个待选模板中筛选出文本数据集的模板。
在某些实施方式中,处理器101用于对全部原始语句组的模板进行去重处理,以得到多个待选模板;及用于确定每个待选模板的模板分值;以及用于根据模板分值从多个待选模板中筛选出文本数据集的模板。
如此,可以快速且准确地从全部原始语句组的模板中筛选出文本数据集的模板。可以理解,全部原始语句组的模板可能存在重复的、相同的,对全部原始语句组的模板进行去重处理,可以使得多个待选模板均不同,从而避免对同一个模板进行重复的处理,有利于节约计算资源并提高筛选的效率。另外,基于模板分值进行筛选,可以量化筛选的标准,从而提高筛选的准确性。
请参阅图11,在某些实施方式中,步骤S192包括:
步骤S1921:确定每个待选模板的重复次数;
步骤S1922:将每个待选模板与文本数据集中的每个语句进行匹配,以确定每个待选模板在文本数据集中进行匹配的成功次数;
步骤S1923:确定每个待选模板中非通配符的字符的数量;
步骤S1924:根据重复次数、成功次数和数量中的至少一个,确定每个待选模板的模板分值。
在某些实施方式中,处理器101用于确定每个待选模板的重复次数;及用于将每个待选模板与文本数据集中的每个语句进行匹配,以确定每个待选模板在文本数据集中进行匹配的成功次数;及用于确定每个待选模板中非通配符的字符的数量;以及用于根据重复次数、成功次数和数量中的至少一个,确定每个待选模板的模板分值。
如此,可以快速、准确地实现确定每个待选模板的模板分值。可以理解,在步骤S1921中,确定每个待选模板的重复次数,可与步骤S191对全部原始语句组的模板进行去重处理,同时进行。如此,在去重的过程中确定每个待选模板的重复次数,可以从整体上缩短方法的执行时长。
在步骤S1922中,将每个待选模板与文本数据集中的每个语句进行匹配,也即是在文本数据集中对每个待选模板进行全量匹配。或者说,对于每个待选模板,在该待选模板与文本数据集中的一个语句匹配时,将该待选模板对应的成功次数加1;在该待选模板与文本数据集中的一个语句不匹配时,保持该待选模板对应的成功次数。这样,就可以确定每个待选模板的成功次数。
在步骤S1924中,可以根据重复次数、成功次数和数量中的一个、两个或者全部,确定每个待选模板的模板分值。在本实施方式中,根据重复次数、成功次数和数量确定每个待选模板的模板分值。如此,模板分值基于重复次数、成功次数和数量三个维度,使得模板分值能够更加准确地量化待选模板在文本数据集中的情况,从而使得基于模板分值筛选出的文本数据集的模板更加准确。
具体地,在本实施方式中,对于每个待选模板,可采用如下的公式,根据重复次数、成功次数和数量确定模板分值:
S=I·logm·logn;公式(3)
其中,S为待选模板的模板分值,I为待选模板中非通配符的字符的数量,m为待选模板在文本数据集中进行匹配的成功次数,n为待选模板的重复次数。
如此,使得文本数据集的模板尽可能多地覆盖文本数据集,并尽可能少地损失信息。例如,待选模板为“*”,可以匹配所有数据,但会丢失待选模板本身全部的信息。又如,待选模板为“AAAA”,虽然保存了所有的信息,但只能匹配一条数据。而采用上述公式(3)来确定模板分值,以评价待选模板的有效性,从而从多个待选模板中筛选出文本数据集的模板,可以使得模板提取的效果较好。
可以理解,也可将重复次数、成功次数和数量之和作为模板分值;还可将重复次数、成功次数和数量之积作为模板分值。在此不对模板分值的具体确定方式进行限定。
可以理解,在其他的一些实施方式中,可根据重复次数确定每个待选模板的模板分值;在其他的另一些实施方式中,可根据重复次数和成功次数确定每个待选模板的模板分值。在此不进行限定。
请参阅图12,在某些实施方式中,步骤S193包括:
步骤S1931:按照模板分值由高至低的顺序,对多个待选模板排序,以得到多个待选模板的序列号;
步骤S1932:将序列号小于预设序列号的待选模板,作为文本数据集的模板。
在某些实施方式中,处理器101用于按照模板分值由高至低的顺序,对多个待选模板排序,以得到多个待选模板的序列号;以及用于将序列号小于预设序列号的待选模板,作为文本数据集的模板。
如此,通过排序,实现根据模板分值从多个待选模板中筛选出文本数据集的模板,效率较高。可以理解,这样筛选出来的文本数据集的模板,也即是模板分值由高至低的顺序中前预设数量的待选模板。
具体地,预设数量可以基于输入信息确定。如此,用户可以根据需要对文本数据集的模板的数量进行调整。
另外,预设数量也可基于待选模板的数量确定。例如,将待选模板的数量的预定比例作为预设数量。预设数量也可基于文本数据集的语句数量确定。例如,将文本数据集的语句数量的预定比例作为预设数量。在此不对预设数量的具体确定方式进行限定。
例如,预设数量为2个,待选模板的数量为5个,分别为:“欢迎来到*”、“今天*的气温是*度到*度”、“您本月的*已经用完请及时充值”、“今天*的天气*”、“*欢迎您”。“欢迎来到*”的模板分值为3、“今天*的气温是*度到*度”的模板分值为5、“您本月的*已经用完请及时充值”的模板分值为10、“今天*的天气*”的模板分值为7、“*欢迎您”的模板分值为4。根据5个模板分值由高到低的顺序对5个待选模板进行排序,得到的顺序是:“您本月的*已经用完请及时充值”、“今天*的天气*”、“今天*的气温是*度到*度”、“*欢迎您”、“欢迎来到*”。所以,筛选出的文本数据集的模板为:“您本月的*已经用完请及时充值”、“今天*的天气*”。
本申请实施方式提供了一种包含计算机可执行指令的非易失性计算机可读存储介质,当计算机可执行指令被一个或多个处理器101执行时,使得处理器101执行上述的文本模板提取方法。
例如执行:步骤S12:获取文本数据集中的一组原始语句组,原始语句组包括多个待处理语句;步骤S13:建立多个待处理语句的匹配矩阵;步骤S14:根据匹配矩阵确定多个待处理语句中是否存在匹配部分;步骤S17:在根据匹配矩阵确定多个待处理语句中存在匹配部分时,移除每个待处理语句中的匹配部分,以更新多个待处理语句,并进入确定多个待处理语句的匹配矩阵的步骤;步骤S18:在根据匹配矩阵确定多个待处理语句中不存在匹配部分时,将当前的多个待处理语句作为最终语句组,并根据通配符和最终语句组处理原始语句组,以得到原始语句组的模板。
本申请实施方式的存储介质,从文本数据集中获取一组原始语句组,原始语句组包括多个待处理语句,再直接对多个待处理语句进行提取处理,以得到原始语句组的模板,无需在提取处理前对待处理语句进行标注和编码,既避免了标注和编码产生的误差,又减少了人工介入,有利于提高模板提取的效率和准确性。
综合以上,本发明提出了一种基于史密斯-沃特曼算法(Smith-Waterman Algorithm,SW算法)的文本模板提取方法。该方法基于字的维度处理文本来提取模板,克服了因分词产生的额外开销以及可能存在的分词误差。同时最大程度地降低了人工介入,大大减少了人工产生的主观误差以及开销,使结果尽可能地客观而高效。SW算法是一种生物信息学中用于找出两个核苷酸序列或蛋白质序列之间的相似区域的算法。本方法将SW算法应用于文本模板的提取中,使得提取的效率和效果更好。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,的程序可存储于一非易失性计算机可读存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等。
以上实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解 为对本申请专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (21)

  1. 一种文本模板提取方法,其特征在于,包括:
    获取文本数据集中的一组原始语句组,所述原始语句组包括多个待处理语句;
    建立多个所述待处理语句的匹配矩阵;
    根据所述匹配矩阵确定多个所述待处理语句中是否存在匹配部分;
    在根据所述匹配矩阵确定多个所述待处理语句中存在匹配部分时,移除每个所述待处理语句中的所述匹配部分,以更新多个所述待处理语句,并进入所述确定多个所述待处理语句的匹配矩阵的步骤;
    在根据所述匹配矩阵确定多个所述待处理语句中不存在匹配部分时,将当前的多个所述待处理语句作为最终语句组,并根据通配符和所述最终语句组处理所述原始语句组,以得到所述原始语句组的模板。
  2. 根据权利要求1所述的文本模板提取方法,其特征在于,建立多个所述待处理语句的匹配矩阵,包括:
    确定每个所述待处理语句中的每个待处理字符与其他待处理语句中的每个待处理字符的匹配分值;
    根据所述匹配分值建立所述匹配矩阵。
  3. 根据权利要求2所述的文本模板提取方法,其特征在于,多个所述待处理语句包括第一语句和第二语句,确定每个所述待处理语句中的每个待处理字符与其他待处理语句中的每个待处理字符的匹配分值,包括:
    在所述第一语句的第一当前字符与所述第二语句的第二当前字符匹配时,将第一预设分值作为所述第一当前字符和所述第二当前字符的所述匹配分值;
    在所述第一当前字符与所述第二当前字符不匹配时,将第二预设分值作为所述第一当前字符和所述第二当前字符的所述匹配分值,所述第二预设分值小于所述第一预设分值。
  4. 根据权利要求2所述的文本模板提取方法,其特征在于,多个所述待处理语句包括第一语句和第二语句,所述第一语句包括第一当前字符,所述第二语句包括第二当前字符,所述匹配分值包括所述第一当前字符与所述第二当前字符的当前匹配分值,所述匹配矩阵包括当前位置,根据所述匹配分值建立所述匹配矩阵,包括:
    以预设的初始值初始化所述匹配矩阵的第一行和第一列;
    根据所述当前匹配分值,和所述当前位置的左上方位置的矩阵值,确定所述当前位置的第一待选值;
    将所述当前位置的每个上方位置的矩阵值减去第一惩罚值,以得到每个上方惩罚值,并将所述上方惩罚值中的最大值作为所述当前位置的第二待选值;
    将所述当前位置的每个左方位置的矩阵值减去第二惩罚值,以得到每个左方惩罚值,并将所述左方惩罚值中的最大值作为所述当前位置的第三待选值;
    将所述第一待选值、所述第二待选值、所述第三待选值和所述初始值中的最大值,作为所述当前位置的矩阵值。
  5. 根据权利要求1所述的文本模板提取方法,其特征在于,根据所述匹配矩阵确定多个所述待处理语句中是否存在匹配部分,包括:
    在所述匹配矩阵的矩阵值全为预设的初始值时,确定多个所述待处理语句中不存在匹配部分;
    在所述匹配矩阵的矩阵值不全为预设的初始值时,确定所述匹配矩阵的最大矩阵值;
    根据所述最大矩阵值回溯所述匹配矩阵,以确定所述多个所述待处理语句中的所述匹配部分。
  6. 根据权利要求1所述的文本模板提取方法,其特征在于,所述原始语句组包括原始语句,所述最终语句组包括与所述原始语句对应的最终语句,根据通配符和多个所述最终语句组处理所述原始语句组,包括:
    确定所述原始语句与所述最终语句的不同字符;
    利用所述通配符连接在所述原始语句中不相邻的所述不同字符。
  7. 根据权利要求1所述的文本模板提取方法,其特征在于,所述文本模板提取方法包括:
    将所述文本数据集中的语句分组,以得到多个原始语句组;
    在得到全部所述原始语句组的模板后,从全部所述原始语句组的模板中筛选出所述文本数据集的模板。
  8. 根据权利要求7所述的文本模板提取方法,其特征在于,从全部所述原始语句组的模板中筛选出所述文本数据集的模板,包括:
    对全部所述原始语句组的模板进行去重处理,以得到多个待选模板;
    确定每个所述待选模板的模板分值;
    根据所述模板分值从多个所述待选模板中筛选出所述文本数据集的模板。
  9. 根据权利要求8所述的文本模板提取方法,其特征在于,确定每个所述待选模板的模板分值,包括:
    确定每个所述待选模板的重复次数;
    将每个所述待选模板与所述文本数据集中的每个语句进行匹配,以确定每个所述待选模板在所述文本数据集中进行匹配的成功次数;
    确定每个所述待选模板中非通配符的字符的数量;
    根据所述重复次数、所述成功次数和所述数量中的至少一个,确定每个所述待选模板的模板分值。
  10. 根据权利要求8所述的文本模板提取方法,其特征在于,根据所述模板分值从多个待选模板中筛选出所述文本数据集的模板,包括:
    按照所述模板分值由高至低的顺序,对多个所述待选模板排序,以得到多个所述待选模板的序列号;
    将序列号小于预设序列号的所述待选模板,作为所述文本数据集的模板。
  11. 一种电子设备,其特征在于,包括处理器,所述处理器用于获取文本数据集中的一组原始语句组,所述原始语句组包括多个待处理语句;及用于建立多个所述待处理语句的匹配矩阵;及用于根据所述匹配矩阵确定多个所述待处理语句中是否存在匹配部分;及用于在根据所述匹配矩阵确定多个所述待处理语句中存在匹配部分时,移除每个所述待处理语句中的所述匹配部分,以更新多个所述待处理语句,并进入所述确定多个所述待处理语句的匹配矩阵的步骤;以及用于在根据所述匹配矩阵确定多个所述待处理语句中不存在匹配部分时,将当前的多个所述待处理语句作为最终语句组,并根据通配符和所述最终语句组处理所述原始语句组,以得到所述原始语句组的模板。
  12. 根据权利要求11所述的电子设备,其特征在于,所述处理器用于确定每个所述待处理语句中的每个待处理字符与其他待处理语句中的每个待处理字符的匹配分值;以及用于根据所述匹配分值建立所述匹配矩阵。
  13. 根据权利要求12所述的电子设备,其特征在于,多个所述待处理语句包括第一语句和第二语句,所述处理器用于在所述第一语句的第一当前字符与所述第二语句的第二当前字符匹配时,将第一预设分值作为所述第一当前字符和所述第二当前字符的所述匹配分值;以及用于在所述第一当前字符与所述第二当前字符不匹配时,将第二预设分值作为所述第一当前字符和所述第二当前字符的所述匹配分值,所述第二预设分值小于所述第一预设分值。
  14. 根据权利要求12所述的电子设备,其特征在于,多个所述待处理语句包括第一语句和第二语句,所述第一语句包括第一当前字符,所述第二语句包括第二当前字符,所述匹配分值包括所述第一当前字符与所述第二当前字符的当前匹配分值,所述匹配矩阵包括当前位置,所述处理器用于以预设的初始值初始化所述匹配矩阵的第一行和第一列;及用于根据所述当前匹配分值,和所述当前位置的左上方位置的矩阵值,确定所述当前位置的第一待选值;及用于将所述当前位置的每个上方位置的矩阵值减去第一惩罚值,以得到每个上方惩罚值,并将所述上方惩罚值中的最大值作为所述当前位置的第二待选值;及用于将所述当前位置的每个左方位置的矩阵值减去第 二惩罚值,以得到每个左方惩罚值,并将所述左方惩罚值中的最大值作为所述当前位置的第三待选值;以及用于将所述第一待选值、所述第二待选值、所述第三待选值和所述初始值中的最大值,作为所述当前位置的矩阵值。
  15. 根据权利要求11所述的电子设备,其特征在于,所述处理器用于在所述匹配矩阵的矩阵值全为预设的初始值时,确定多个所述待处理语句中不存在匹配部分;及用于在所述匹配矩阵的矩阵值不全为预设的初始值时,确定所述匹配矩阵的最大矩阵值;以及用于根据所述最大矩阵值回溯所述匹配矩阵,以确定所述多个所述待处理语句中的所述匹配部分。
  16. 根据权利要求11所述的电子设备,其特征在于,所述原始语句组包括原始语句,所述最终语句组包括与所述原始语句对应的最终语句,所述处理器用于确定所述原始语句与所述最终语句的不同字符;以及用于利用所述通配符连接在所述原始语句中不相邻的所述不同字符。
  17. 根据权利要求11所述的电子设备,其特征在于,所述处理器用于将所述文本数据集中的语句分组,以得到多个原始语句组;以及用于在得到全部所述原始语句组的模板后,从全部所述原始语句组的模板中筛选出所述文本数据集的模板。
  18. 根据权利要求17所述的电子设备,其特征在于,所述处理器用于对全部所述原始语句组的模板进行去重处理,以得到多个待选模板;及用于确定每个所述待选模板的模板分值;以及用于根据所述模板分值从多个所述待选模板中筛选出所述文本数据集的模板。
  19. 根据权利要求18所述的电子设备,其特征在于,所述处理器用于确定每个所述待选模板的重复次数;及用于将每个所述待选模板与所述文本数据集中的每个语句进行匹配,以确定每个所述待选模板在所述文本数据集中进行匹配的成功次数;及用于确定每个所述待选模板中非通配符的字符的数量;以及用于根据所述重复次数、所述成功次数和所述数量中的至少一个,确定每个所述待选模板的模板分值。
  20. 根据权利要求18所述的电子设备,其特征在于,所述处理器用于按照所述模板分值由高至低的顺序,对多个所述待选模板排序,以得到多个所述待选模板的序列号;以及用于将序列号小于预设序列号的所述待选模板,作为所述文本数据集的模板。
  21. 一种包含计算机可执行指令的非易失性计算机可读存储介质,其特征在于,当所述计算机可执行指令被一个或多个处理器执行时,使得所述处理器执行权利要求1-10中任一项所述的文本模板提取方法。
PCT/CN2020/092871 2020-05-28 2020-05-28 文本模板提取方法、电子设备和存储介质 WO2021237562A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080099874.0A CN115803748A (zh) 2020-05-28 2020-05-28 文本模板提取方法、电子设备和存储介质
PCT/CN2020/092871 WO2021237562A1 (zh) 2020-05-28 2020-05-28 文本模板提取方法、电子设备和存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/092871 WO2021237562A1 (zh) 2020-05-28 2020-05-28 文本模板提取方法、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2021237562A1 true WO2021237562A1 (zh) 2021-12-02

Family

ID=78745296

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/092871 WO2021237562A1 (zh) 2020-05-28 2020-05-28 文本模板提取方法、电子设备和存储介质

Country Status (2)

Country Link
CN (1) CN115803748A (zh)
WO (1) WO2021237562A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115563964A (zh) * 2022-11-10 2023-01-03 北京泰迪熊移动科技有限公司 短信文本正则生成方法、装置、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101221558A (zh) * 2008-01-22 2008-07-16 安徽科大讯飞信息科技股份有限公司 句子模板自动提取的方法
CN105589843A (zh) * 2014-10-24 2016-05-18 科大讯飞股份有限公司 一种文本字串匹配方法及系统
US20160306885A1 (en) * 2013-12-11 2016-10-20 Beijing Qihoo Technology Company Limited Method and apparatus for determining core word of image cluster description text
CN108427722A (zh) * 2018-02-09 2018-08-21 卫盈联信息技术(深圳)有限公司 智能交互方法、电子装置及存储介质
CN109670163A (zh) * 2017-10-17 2019-04-23 阿里巴巴集团控股有限公司 信息识别方法、信息推荐方法、模板构建方法及计算设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101221558A (zh) * 2008-01-22 2008-07-16 安徽科大讯飞信息科技股份有限公司 句子模板自动提取的方法
US20160306885A1 (en) * 2013-12-11 2016-10-20 Beijing Qihoo Technology Company Limited Method and apparatus for determining core word of image cluster description text
CN105589843A (zh) * 2014-10-24 2016-05-18 科大讯飞股份有限公司 一种文本字串匹配方法及系统
CN109670163A (zh) * 2017-10-17 2019-04-23 阿里巴巴集团控股有限公司 信息识别方法、信息推荐方法、模板构建方法及计算设备
CN108427722A (zh) * 2018-02-09 2018-08-21 卫盈联信息技术(深圳)有限公司 智能交互方法、电子装置及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115563964A (zh) * 2022-11-10 2023-01-03 北京泰迪熊移动科技有限公司 短信文本正则生成方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN115803748A (zh) 2023-03-14

Similar Documents

Publication Publication Date Title
CN106649783B (zh) 一种同义词挖掘方法和装置
CN111709243A (zh) 一种基于深度学习的知识抽取方法与装置
JP2019512127A (ja) 文字列距離計算方法及び装置
CN109299263B (zh) 文本分类方法、电子设备
CN116361801B (zh) 基于应用程序接口语义信息的恶意软件检测方法及系统
CN107357895B (zh) 一种基于词袋模型的文本表示的处理方法
CN111160014B (zh) 一种智能分词方法
CN111177375A (zh) 一种电子文档分类方法及装置
CN112528022A (zh) 主题类别对应的特征词提取和文本主题类别识别方法
WO2021237562A1 (zh) 文本模板提取方法、电子设备和存储介质
CN115146062A (zh) 融合专家推荐与文本聚类的智能事件分析方法和系统
JP5049965B2 (ja) データ処理装置及び方法
CN109359090A (zh) 基于卷积神经网络的文件碎片分类方法及系统
US11144712B2 (en) Dictionary creation apparatus, dictionary creation method, and non-transitory computer-readable storage medium for storing dictionary creation program
CN113539372A (zh) 一种LncRNA和疾病关联关系的高效预测方法
CN111783088A (zh) 一种恶意代码家族聚类方法、装置和计算机设备
CN110472031A (zh) 一种正则表达式获得方法、装置、电子设备及存储介质
JP6356015B2 (ja) 遺伝子発現情報解析装置、遺伝子発現情報解析方法、及びプログラム
CN114387602B (zh) 医疗ocr数据优化模型训练方法、优化方法及设备
CN110750984A (zh) 命令行字符串处理方法、终端、装置及可读存储介质
US20220171815A1 (en) System and method for generating filters for k-mismatch search
CN108733733B (zh) 基于机器学习的生物医学文本分类方法、系统和存储介质
CN112651590B (zh) 一种指令处理流程推荐的方法
JP2015225662A (ja) 人名ユニット辞書の拡張方法、人名言語の認識方法及び人名言語の認識装置
CN113159211A (zh) 用于相似图像检索的方法、计算设备和计算机存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20937926

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 170423)

122 Ep: pct application non-entry in european phase

Ref document number: 20937926

Country of ref document: EP

Kind code of ref document: A1