WO2021207939A1 - 句式挖掘方法、装置、电子设备以及存储介质 - Google Patents

句式挖掘方法、装置、电子设备以及存储介质 Download PDF

Info

Publication number
WO2021207939A1
WO2021207939A1 PCT/CN2020/084769 CN2020084769W WO2021207939A1 WO 2021207939 A1 WO2021207939 A1 WO 2021207939A1 CN 2020084769 W CN2020084769 W CN 2020084769W WO 2021207939 A1 WO2021207939 A1 WO 2021207939A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
general
patterns
sentence pattern
standard
Prior art date
Application number
PCT/CN2020/084769
Other languages
English (en)
French (fr)
Inventor
李森林
Original Assignee
深圳市欢太数字科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太数字科技有限公司 filed Critical 深圳市欢太数字科技有限公司
Priority to PCT/CN2020/084769 priority Critical patent/WO2021207939A1/zh
Priority to CN202080094177.6A priority patent/CN115039105A/zh
Publication of WO2021207939A1 publication Critical patent/WO2021207939A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique

Definitions

  • This application relates to the technical field of electronic equipment, and more specifically, to a sentence mining method, device, electronic equipment, and storage medium.
  • this application proposes a sentence mining method, device, electronic equipment, and storage medium to solve the above problems.
  • an embodiment of the present application provides a sentence mining method, the method includes: obtaining a plurality of corpora to be mined; performing a dual sequence comparison on the plurality of corpora to be mined to obtain the plurality of corpora to be mined A plurality of general sentence patterns corresponding to the corpus; filtering the plurality of general sentence patterns, and selecting a general sentence pattern that meets a specified standard from the plurality of general sentence patterns as a standard sentence pattern.
  • an embodiment of the present application provides a sentence pattern mining device.
  • the device includes: a corpus to be mined acquisition module for acquiring a plurality of corpora to be mined; a general sentence pattern acquisition module for analysing the plurality of The corpus to be mined performs a double-sequence comparison to obtain multiple general sentence patterns corresponding to the multiple corpus to be mined; the standard sentence pattern obtaining module is used to filter the multiple general sentence patterns from From the sentence patterns, select general sentence patterns that meet the specified criteria as standard sentence patterns.
  • an embodiment of the present application provides an electronic device, including a memory and a processor, the memory is coupled to the processor, the memory stores instructions, and the instructions are executed when the instructions are executed by the processor.
  • the processor executes the above method.
  • an embodiment of the present application provides a computer readable storage medium, and the computer readable storage medium stores program code, and the program code can be invoked by a processor to execute the above method.
  • the sentence pattern mining method, device, electronic device, and storage medium provided by the embodiments of the present application obtain multiple corpora to be mined, perform double sequence comparison on the multiple corpora to be mined, and obtain multiple general sentences corresponding to the multiple corpora to be mined It filters multiple general sentence patterns, and selects the general sentence patterns that meet the specified criteria from the multiple general sentence patterns as the standard sentence pattern, so that the general sentence pattern is obtained by double-sequence comparison of the corpus to be mined, and then the general sentence pattern is obtained.
  • the sentence patterns are filtered to obtain standard sentence patterns, which can be quickly and conveniently obtained from the corpus to be mined for processing.
  • FIG. 1 shows a schematic flowchart of a sentence mining method provided by an embodiment of the present application
  • Figure 2 shows a schematic flowchart of a sentence mining method provided by another embodiment of the present application
  • FIG. 3 shows a schematic diagram of sentence pattern inclusion relationships among multiple general sentence patterns provided by an embodiment of the present application
  • FIG. 4 shows a schematic flowchart of step S240 of the sentence pattern mining method shown in FIG. 2 of the present application
  • FIG. 5 shows a schematic flowchart of a sentence pattern mining method provided by another embodiment of the present application.
  • FIG. 6 shows a schematic flowchart of step S330 of the sentence pattern mining method shown in FIG. 5 of the present application
  • FIG. 7 shows a schematic flowchart of step S332 of the sentence pattern mining method shown in FIG. 6 of the present application.
  • FIG. 8 shows a schematic flowchart of a sentence mining method provided by another embodiment of the present application.
  • FIG. 9 shows a schematic flowchart of step S440 of the sentence pattern mining method shown in FIG. 8 of the present application.
  • FIG. 10 shows a schematic flowchart of a sentence mining method provided by another embodiment of the present application.
  • Fig. 11 shows a block diagram of a sentence pattern mining device provided by an embodiment of the present application.
  • FIG. 12 shows a block diagram of an electronic device used to execute the sentence pattern mining method according to an embodiment of the present application
  • FIG. 13 shows a storage unit used to store or carry program code for implementing the sentence pattern mining method according to the embodiment of the present application according to an embodiment of the present application.
  • a method based on a large-scale language model using a large amount of corpus training, a large-scale language model (such as BERT) is trained to obtain embedded expressions of related fixed sentence patterns.
  • a large-scale language model such as BERT
  • the domain category of some sentence patterns in short text classification scenarios only depends on the entity part, such as what is [entity], who is [entity], and [entity] is diverse, and the classification of this type of problem cannot be achieved.
  • the common sentence pattern is obtained by double-sequence comparison of the corpus to be mined. Common sentence patterns are filtered to obtain standard sentence patterns, which can be quickly and conveniently obtained from the corpus to be mined for processing. Among them, the specific sentence mining method will be described in detail in the subsequent embodiments.
  • FIG. 1 shows a schematic flowchart of a sentence pattern mining method provided by an embodiment of the present application.
  • the sentence pattern mining method is used to obtain general sentence patterns by double-sequence comparison of the corpus to be mined, and then filter the general sentence patterns to obtain the standard sentence patterns, so as to quickly and conveniently obtain the standard sentence patterns from the corpus to be mined for processing .
  • the sentence pattern mining method is applied to the sentence pattern mining device 200 shown in FIG. 11 and the electronic device 100 equipped with the sentence pattern mining device 200 (FIG. 12 ).
  • FIG. 12 shows a schematic flowchart of a sentence pattern mining method provided by an embodiment of the present application.
  • the sentence pattern mining method is used to obtain general sentence patterns by double-sequence comparison of the corpus to be mined, and then filter the general sentence patterns to obtain the standard sentence patterns, so as to quickly and conveniently obtain the standard sentence patterns from the corpus to be mined for processing .
  • the sentence pattern mining method is applied to the sentence pattern mining device 200 shown in FIG. 11 and the
  • the electronic device applied in this embodiment may include a mobile terminal, a tablet computer, a desktop computer, a wearable electronic device, etc., which is not limited herein.
  • the flow shown in Figure 1 will be described in detail below.
  • the sentence mining method may specifically include the following steps:
  • Step S110 Obtain multiple corpora to be mined.
  • multiple pieces of corpus to be mined can be obtained.
  • multiple pieces of corpus to be mined can be obtained from community question and answer, can be obtained from short text, or part of it can be obtained from community question and answer, and the other part is obtained from short text, etc., which are not limited here.
  • multiple pieces of corpus to be mined can be obtained from the server, for example, from community question and answer or short text recorded in the server, and multiple pieces of corpus to be mined can also be obtained from other electronic devices, for example, from other electronic devices. Obtained from the recorded community question and answer or short text, where, when multiple corpora to be mined are obtained from a server or other electronic device, they can be obtained from the server or other electronic device through a wireless network or a data network.
  • the “country bird of the chestnut-breasted white-faced warbler” can be obtained from the community question and answer as the corpus to be mined, which can be obtained from the community question and answer Obtaining "Which country is the city of Alvin" as the corpus to be excavated, etc., is not limited here.
  • Step S120 Perform a two-sequence comparison on the multiple corpora to be mined to obtain multiple general sentence patterns corresponding to the multiple corpora to be mined.
  • pairwise alignment can be performed on the multiple corpora to be mined to obtain multiple general sentence patterns corresponding to the multiple corpora to be mined.
  • double sequence alignment is one of the fields of bioinformatics research. The research method is to design a targeted and effective algorithm to compare two DNA or protein sequences, find the maximum similarity match between the two, and then judge Does it have homology?
  • a double sequence alignment method is used to process multiple corpora to be mined to obtain the largest similar matching sentence pattern among the multiple corpora to be mined, that is, multiple common sentences corresponding to the multiple corpora to be mined Therefore, the sentence pattern learning can be transferred by introducing the double-sequence comparison algorithm in bioinformatics, which can match sentence patterns in byte units, avoiding errors caused by traditional segmentation methods due to semantic segmentation errors and artificial spelling errors.
  • the multiple corpora to be mined may be paired by pairwise sequence comparison to obtain multiple general sentence patterns corresponding to the multiple corpora to be mined.
  • Step S130 Filter the multiple general sentence patterns, and filter the general sentence patterns that meet the specified standard from the multiple general sentence patterns as the standard sentence pattern.
  • the double-sequence comparison of multiple corpora to be mined will generally extract a large number of general sentence patterns. Therefore, a quantitative mechanism can be used to mine sentence patterns with a certain concrete meaning and a certain generalization ability. .
  • the multiple general sentence patterns can be filtered to filter out the multiple general sentence patterns that meet the specified criteria.
  • General sentence patterns are used as standard sentence patterns. Among them, general sentence patterns that meet the specified standards can refer to sentence patterns that have a certain concrete meaning and a certain generalization ability, so that the quantified index is used to measure the generalization degree and concrete meaning of the standard sentence pattern. In order to make the standard sentence patterns mined from multiple corpora to be mined more accurate.
  • general sentence pattern filtering rules may be preset and stored. After multiple general sentence patterns corresponding to multiple corpus to be mined are obtained, multiple general sentence patterns may be filtered based on the general sentence pattern filtering rules to From multiple general sentence patterns, select general sentence patterns that meet the specified criteria as standard sentence patterns. As a way, after obtaining multiple general sentence patterns corresponding to multiple corpus to be mined, it can be judged in turn whether the multiple general sentence patterns satisfy the general sentence pattern filtering rules, and the multiple general sentence patterns can be filtered according to the judgment result The general sentence pattern that meets the specified standard is regarded as the standard sentence pattern.
  • the general sentence pattern that satisfies the general sentence pattern filtering rules as the result of the judgment can be determined as meeting the specified standard, that is, it is determined as the standard sentence pattern, and the judgment result is characterized as not satisfying the general sentence pattern.
  • the general sentence pattern of the sentence pattern filtering rules is determined as not satisfying the specified standard, that is, it is determined as a non-standard sentence pattern.
  • the sentence pattern mining method obtaineds multiple corpora to be mined, performs a double sequence comparison on the multiple corpora to be mined, and obtains multiple general sentence patterns corresponding to the multiple corpus to be mined, and compares multiple general sentences. Filter the general sentence patterns, and select the general sentence patterns that meet the specified criteria from multiple general sentence patterns as the standard sentence patterns, so as to obtain the general sentence patterns by double-sequence comparison of the corpus to be mined, and then filter the general sentence patterns to obtain the standard sentence In order to quickly and conveniently obtain standard sentence patterns from the corpus to be mined for processing.
  • FIG. 2 shows a schematic flowchart of a sentence pattern mining method provided by another embodiment of the present application.
  • the process shown in Figure 2 will be described in detail below.
  • the sentence mining method may specifically include the following steps:
  • Step S210 Obtain multiple corpora to be mined.
  • Step S220 Perform a two-sequence comparison on the multiple corpora to be mined to obtain multiple general sentence patterns corresponding to the multiple corpora to be mined.
  • step S210 to step S220 please refer to step S110 to step S120, which will not be repeated here.
  • Step S230 Obtain the sentence pattern inclusion relationship between the multiple general sentence patterns, and acquire the sentence complexity of each general sentence pattern in the multiple general sentence patterns.
  • sentence pattern inclusion relationships among the multiple general sentence patterns can be acquired.
  • sentence pattern inclusion relationship between the multiple general sentence patterns can be obtained.
  • the relationship between parent and child nodes can be divided based on the sample coverage of multiple common sentence patterns, the sentence with the largest coverage can be set as the parent node, and the remaining common sentence patterns from large to small sample coverage can be divided into
  • the child nodes at different levels from top to bottom that is, the parent node has the largest generalization ability, but it does not have a certain concrete meaning.
  • the generalization ability of the child nodes at different levels from top to bottom decreases sequentially, but its The concrete meaning increases sequentially.
  • FIG. 3 shows a schematic diagram of a sentence pattern inclusion relationship among a plurality of general sentence patterns provided by an embodiment of the present application.
  • multiple general sentence patterns include: general sentence pattern S 0 , general sentence pattern S 1 , general sentence pattern General sentence pattern General sentence pattern General sentence pattern General sentence pattern General sentence pattern —, of which, the general sentence pattern S 0 covers the general sentence pattern General sentence pattern And general sentence patterns
  • General sentence pattern S 1 covers general sentence pattern General sentence pattern And general sentence patterns General sentence pattern Covering common sentences And general sentence patterns General sentence pattern Covering common sentences And general sentence patterns General sentence pattern Covering common sentences Therefore, the general sentence pattern S 0 and the general sentence pattern S 1 can be determined as the parent node, and the general sentence pattern General sentence pattern General sentence pattern General sentence pattern General sentence pattern General sentence pattern —Determined as a child node.
  • the sentence complexity of each general sentence pattern of the plurality of general sentence patterns can be acquired.
  • the greater the complexity of the general sentence pattern the more complex it characterizes the general sentence pattern, and the more concrete meaning it has.
  • it can be based on Get the sentence complexity of each general sentence pattern in a plurality of general sentence patterns, where n represents the number of times the general sentence pattern is divided, and t represents the number of words in each segment of the general sentence pattern, for example, the general sentence pattern "(.+?) is (.+?) which country's (.+?) sentence complexity
  • Step S240 Filter the plurality of general sentence patterns based on the sentence pattern inclusion relationship between the plurality of general sentence patterns and the sentence complexity of each general sentence pattern, and select from the plurality of general sentence patterns Select the general sentence pattern that meets the specified standard as the standard sentence pattern.
  • the sentence pattern inclusion relationship between multiple general sentence patterns can be used to reflect the generalization ability of each general sentence pattern in the multiple general sentence patterns, and the sentence pattern of each general sentence pattern in the multiple general sentence patterns Complexity can be used to reflect the concrete meaning of each general sentence pattern in a plurality of general sentence patterns. Therefore, in this embodiment, the sentence inclusion relationship between the multiple general sentence patterns and each general sentence pattern are obtained.
  • a general sentence pattern that meets the specified standard is used as a standard sentence pattern. It is understandable that the general sentence patterns selected from multiple general sentence patterns that meet the specified criteria can have a certain generalization ability and a certain concrete meaning according to requirements.
  • the set requirement is to filter out general sentence patterns with strong generalization ability and weak concrete meaning, it can be based on the sentence inclusion relationship between multiple general sentence patterns and each general sentence pattern.
  • the sentence complexity of the sentence pattern filters multiple general sentence patterns to filter out the general sentence patterns with larger sample coverage and smaller sentence complexity as the standard sentence patterns.
  • the set requirement is to filter out general sentence patterns with weak generalization ability and strong concrete meaning, it can be based on the sentence inclusion relationship between multiple general sentence patterns and each general sentence pattern.
  • the sentence complexity of the sentence pattern filters multiple general sentence patterns to filter out the general sentence patterns with smaller sample coverage and larger sentence structure complexity as the standard sentence patterns.
  • the set requirement is to filter out general sentence patterns with a certain generalization ability and a certain concrete meaning, it can be based on the sentence inclusion relationship between multiple general sentence patterns and each sentence.
  • the sentence complexity of a general sentence pattern filters multiple general sentence patterns to filter out the sentence inclusion relationship with other general sentence patterns from the multiple general sentence patterns, which meets the first specified standard, and the sentence pattern is complex
  • the general sentence pattern that satisfies the second specified standard is regarded as the standard sentence pattern.
  • the first designated standard can be preset and stored as a basis for judging the sentence inclusion relationship between a certain general sentence pattern and other general sentence patterns.
  • the sentence pattern inclusion relationship between a certain general sentence pattern and other general sentence patterns can be compared with the first specified standard to determine the sentence pattern between a certain general sentence pattern and other general sentence patterns Whether the containment relationship meets the first specified standard.
  • the second specified standard can be preset and stored as the basis for judging the complexity of each general sentence pattern. Therefore, after obtaining the sentence complexity of each general sentence pattern, the The sentence pattern complexity is compared with the second designated standard model to determine whether the sentence pattern complexity of each general sentence pattern meets the second designated standard.
  • FIG. 4 shows a schematic flowchart of step S240 of the sentence pattern mining method shown in FIG. 2 of the present application.
  • the following will elaborate on the process shown in FIG. 4, and the method may specifically include the following steps:
  • Step S241 Based on the sentence pattern inclusion relationship between the plurality of general sentence patterns, obtain the image entry degree of each general sentence pattern in the plurality of general sentence patterns.
  • each general sentence pattern in the multiple general sentence patterns may be acquired based on the sentence pattern inclusion relationship between the multiple general sentence patterns ⁇ .
  • each general sentence pattern in the multiple general sentence patterns may be acquired based on the sentence pattern inclusion relationship between the multiple general sentence patterns Image entry Among them, the degree of image entry To a certain extent, it reflects the generalization ability of the general sentence pattern. As shown in Figure 3, the general sentence pattern among multiple general sentence patterns Image entry Common sentence pattern among multiple common sentence patterns Image entry Explain common sentence patterns Generalization ability The generalization ability is strong.
  • Step S242 Filter the multiple general sentence patterns based on the image entry degree of each general sentence pattern and the complexity of each general sentence pattern, and filter out the multiple general sentence patterns that meet the specified requirements.
  • Standard general sentence patterns are used as standard sentence patterns.
  • the graphical penetration of each general sentence pattern in the multiple general sentence patterns can be used to reflect the generalization ability of each general sentence pattern in the multiple general sentence patterns, and each general sentence in the multiple general sentence patterns
  • the complexity of the sentence pattern can be used to reflect the concrete meaning of each general sentence pattern in a plurality of general sentence patterns. Therefore, in this embodiment, the picture entry degree of each general sentence pattern and each general sentence pattern are obtained.
  • the set requirement is to filter out general sentence patterns with strong generalization ability and weak concrete meaning, it can be based on the degree of entry of each general sentence pattern and the level of each general sentence pattern.
  • Sentence pattern complexity filters multiple general sentence patterns to filter out the general sentence patterns with larger picture entry and lower sentence complexity as the standard sentence pattern.
  • the set requirement is to filter out general sentence patterns with weak generalization ability and strong concrete meaning, it can be based on the degree of entry of each general sentence pattern and the degree of each general sentence pattern.
  • Sentence pattern complexity filters multiple general sentence patterns to filter out the general sentence patterns with smaller picture entry and larger sentence structure complexity as the standard sentence pattern.
  • the set requirement is to filter out general sentence patterns with a certain generalization ability and a certain concrete meaning, it can be based on the degree of entry of each general sentence pattern and each general sentence pattern.
  • the sentence complexity of the pattern filters multiple general sentence patterns to filter the general sentence patterns that meet the third specified standard and the sentence complexity meets the second specified standard from the multiple common sentence patterns.
  • Sentence pattern Among them, the third designated standard can be preset and stored as the basis for judging the penetration degree of the common sentence pattern. Therefore, after the penetration degree of the common sentence pattern is obtained, the penetration degree of the common sentence pattern can be compared with the third designated standard. Make comparisons to determine whether the image penetration of common sentence patterns meets the third specified standard.
  • a specified image penetration degree may be preset and stored, and the specified image penetration degree is used as a basis for judging the penetration degree of each general sentence pattern.
  • entry it can be determined that the entry degree of the general sentence pattern satisfies the third specified standard.
  • entry degree of the general sentence pattern is not greater than the specified entry degree, it can be determined that the entry degree of the general sentence pattern does not meet the third specified criterion.
  • the specified complexity can be preset and stored. The specified complexity is used as the basis for judging the complexity of each general sentence pattern.
  • the general sentence pattern When the complexity of the general sentence pattern is greater than the specified complexity, the general sentence pattern can be determined The complexity meets the second specified standard, and when the complexity of the general sentence pattern is not greater than the specified complexity, it can be determined that the complexity of the general sentence pattern does not meet the second specified standard. Therefore, in this embodiment, based on the above-mentioned specified image in-degree and specified complexity, it is possible to filter out common sentences whose image in-degree is greater than the specified in-degree and whose sentence complexity is greater than the specified complexity from a plurality of general sentence patterns. As a standard sentence pattern, the standard sentence pattern obtained has a certain generalization ability and a certain concrete meaning.
  • the sentence pattern mining method obtained by another embodiment of the present application obtains multiple corpora to be mined, performs a double sequence comparison on the multiple corpora to be mined, obtains multiple general sentence patterns of the multiple corpus to be mined, and obtains multiple general sentences
  • the sentence pattern inclusion relationship between the multiple general sentence patterns and the sentence complexity of each general sentence pattern in the multiple general sentence patterns are obtained, based on the sentence pattern inclusion relationship between the multiple general sentence patterns and the sentence of each general sentence pattern
  • the complexity of the pattern filters multiple general sentence patterns, and selects the general sentence patterns that meet the specified criteria from the multiple general sentence patterns as the standard sentence pattern.
  • this embodiment filters multiple general sentence patterns by obtaining the sentence pattern inclusion relationship between the multiple general sentence patterns and the sentence complexity of each general sentence pattern. In order to obtain the standard sentence pattern, in order to improve the accuracy of the obtained standard sentence pattern.
  • FIG. 5 shows a schematic flowchart of a sentence pattern mining method provided by another embodiment of the present application.
  • the following will elaborate on the process shown in FIG. 5, and the sentence mining method may specifically include the following steps:
  • Step S310 Obtain multiple corpora to be mined.
  • step S310 For the specific description of step S310, please refer to step S110, which will not be repeated here.
  • Step S320 Obtain the sequence type of each of the multiple corpora to be mined.
  • the double sequence alignment may include global alignment and local alignment, wherein the global alignment is to align each remaining part of each general sentence pattern, which is usually applied to similar sequence types or approximately sequence lengths.
  • the global alignment can be the Needleman-Wunsch algorithm based on dynamic programming, and the local alignment is more suitable for situations where the sequence types are not very similar.
  • the local alignment can be Smith -Waterman algorithm.
  • the data of each corpus to be mined can be obtained. Sequence type.
  • Step S330 Based on the sequence type of each corpus to be mined, determine a processing method for double-sequence alignment of the plurality of corpora to be mined.
  • a processing method for double-sequence alignment of multiple corpora to be mined can be determined based on the sequence type of each corpus to be mined. In some embodiments, after obtaining the sequence type of each corpus to be mined, it can be determined from the global and local alignments to perform dual sequence alignment on multiple corpora to be mined based on the sequence type of each corpus to be mined Processing method.
  • FIG. 6 shows a schematic flowchart of step S330 of the sentence pattern mining method shown in FIG. 5 of the present application.
  • the process shown in FIG. 6 will be described in detail below, and the method may specifically include the following steps:
  • Step S331 Based on the sequence type of each corpus to be mined, obtain the sequence similarity between the plurality of corpus to be mined.
  • the sequence similarity between multiple corpora to be mined may be obtained based on the sequence type of each corpus to be mined. As a way, after obtaining the sequence type of each corpus to be mined, the sequence types of multiple corpora to be mined can be matched to obtain the sequence similarity between the multiple corpora to be mined.
  • Step S332 Based on the sequence similarity between the plurality of corpora to be mined, determine a processing mode for the dual sequence alignment of the plurality of corpora to be mined from the global comparison and the local comparison.
  • the sequence similarity between the multiple corpora to be mined may be determined from the global alignment and the local alignment.
  • the processing method of double-sequence comparison of the corpus that is, based on the sequence similarity between the multiple corpora to be mined, the global alignment is determined as the processing method for the double-sequence comparison of the multiple corpora to be mined, or the local comparison is determined.
  • the processing method of double-sequence alignment for multiple corpora to be mined is determined from the global alignment and the local alignment.
  • FIG. 7 shows a schematic flowchart of step S332 of the sentence pattern mining method shown in FIG. 6 of the present application.
  • the process shown in FIG. 7 will be described in detail below, and the method may specifically include the following steps:
  • Step S3321 When the sequence similarity between the plurality of corpora to be mined is greater than the specified similarity, the global alignment is determined as a processing method of performing a double sequence alignment on the plurality of corpora to be mined.
  • the global alignment is to align each remaining part of each sequence, it is usually applied to the situation that the sequence types are similar or the sequence lengths are approximately the same. Therefore, in this embodiment, when the multiple corpora to be mined are When the sequence similarity is greater than the specified similarity, the global alignment can be determined as a processing method of double-sequence alignment of multiple corpora to be mined.
  • Step S3322 When the sequence similarity between the plurality of corpora to be mined is not greater than the specified similarity, the local alignment is determined as a processing method of performing a double sequence alignment on the plurality of corpora to be mined.
  • the local alignment is more suitable for situations where sequence types are not very similar, in this embodiment, when the sequence similarity between multiple corpora to be mined is not greater than the specified similarity, the local alignment can be determined It is a processing method of double-sequence comparison for multiple corpora to be mined.
  • Step S340 Perform a dual-sequence comparison on the plurality of corpora to be mined based on the processing method to obtain a plurality of general sentence patterns corresponding to the plurality of corpora to be mined.
  • Step S350 Filter the multiple general sentence patterns, and filter the general sentence patterns that meet the specified standard from the multiple general sentence patterns as the standard sentence pattern.
  • step S340-step S350 please refer to step S120-step S130, which will not be repeated here.
  • the sentence mining method provided in another embodiment of the present application obtains multiple corpora to be mined, obtains the sequence type of each corpus to be mined in the multiple corpus to be mined, and determines the pair-to-many based on the sequence type of each corpus to be mined
  • the processing method of double-sequence comparison of the corpus to be mined is performed to obtain multiple general sentence patterns corresponding to the multiple corpus to be mined, and perform multiple general sentence patterns. Filter, filter out a common sentence pattern that meets the specified criteria from multiple common sentence patterns as a standard sentence pattern.
  • this embodiment determines the adopted double sequence alignment method based on the corpus type of each corpus to be mined, so as to improve the accuracy of the obtained general sentence pattern.
  • FIG. 8 shows a schematic flowchart of a sentence pattern mining method provided by another embodiment of the present application.
  • the process shown in Fig. 8 will be described in detail below.
  • the sentence mining method may specifically include the following steps:
  • Step S410 Obtain multiple corpora to be mined.
  • Step S420 Perform a dual sequence comparison on the multiple corpora to be mined to obtain multiple general sentence patterns corresponding to the multiple corpora to be mined.
  • Step S430 Filter the multiple general sentence patterns, and filter the general sentence patterns that meet the specified standard from the multiple general sentence patterns as the standard sentence pattern.
  • step S410-step S430 please refer to step S110-step S130, which will not be repeated here.
  • Step S440 Output the standard sentence pattern.
  • the standard sentence pattern can be output to serve the subsequent NLP downstream tasks.
  • this embodiment can be used to assist intention recognition: automatically mine high-frequency questions/questions from user historical question and answer data, assist analysts/product managers to quickly understand user intentions, and liberate labor costs.
  • this embodiment can also be used to improve the effect of the text classification model: in the short text classification task, part of the sentence pattern cooperates with the entity information to effectively process the classified text that the entity depends on, as a priori/external knowledge embedding model.
  • this embodiment can also be used for answer templates for community question and answer tasks: in NLP question and answer tasks, the user’s high-frequency questioning methods are discovered, and then the answer template sentence patterns are prepared pertinently (the answers to some questions in the partly sensitive vertical question and answer It needs to be limited to a certain sentence pattern, such as financial customer service, or to mine the sentence patterns of Q and A from large-scale community question and answer (Q, A) pairs, and sort A into Q's answer template.
  • NLP question and answer tasks the user’s high-frequency questioning methods are discovered, and then the answer template sentence patterns are prepared pertinently (the answers to some questions in the partly sensitive vertical question and answer It needs to be limited to a certain sentence pattern, such as financial customer service, or to mine the sentence patterns of Q and A from large-scale community question and answer (Q, A) pairs, and sort A into Q's answer template.
  • FIG. 9 shows a schematic flowchart of step S440 of the sentence pattern mining method shown in FIG. 8 of the present application.
  • the process shown in FIG. 9 will be described in detail below, and the method may specifically include the following steps:
  • Step S441 When the standard sentence pattern is an inquiry sentence pattern, a standard reply sentence pattern is obtained based on the standard sentence pattern.
  • the determined sentence pattern of the standard sentence pattern can be identified, where the sentence pattern can include a declarative sentence pattern, an inquiry sentence pattern, etc., in this embodiment, when the standard sentence pattern is identified as an inquiry In the sentence pattern, the standard reply sentence pattern corresponding to the standard sentence pattern can be obtained based on the standard sentence pattern.
  • one standard sentence pattern can correspond to one standard reply sentence pattern, and it can correspond to multiple standard reply sentence patterns. Make a limit.
  • Step S442 Output the standard sentence pattern and the standard reply sentence pattern.
  • the standard sentence pattern and the standard reply sentence pattern can be output.
  • the sentence pattern mining method provided by another embodiment of the present application obtains multiple pieces of corpus to be mined, performs double-sequence comparison on the multiple pieces of corpus to be mined, and obtains multiple general sentence patterns corresponding to the multiple pieces of corpus to be mined.
  • the sentence pattern is filtered, and the general sentence pattern that meets the specified standard is selected from multiple general sentence patterns as the standard sentence pattern, and the standard sentence pattern is output.
  • this embodiment also outputs standard sentence patterns for use by corresponding downstream tasks, so as to improve the accurate response of downstream tasks.
  • FIG. 10 shows a schematic flowchart of a sentence pattern mining method provided by yet another embodiment of the present application. The following will elaborate on the process shown in FIG. 10, and the sentence mining method may specifically include the following steps:
  • Step S510 Obtain a training data set, where the training data set includes multiple corpora and multiple standard sentence patterns.
  • the embodiment of the present application also includes a method for training a sentence mining model, wherein the training of the sentence mining model can be carried out in advance according to the acquired training data set, and every time the sentence mining is subsequently performed, it can be According to the sentence pattern mining model for mining processing, there is no need to train the sentence pattern mining model every time the sentence pattern is performed.
  • a training data set may be collected, where the training data set includes multiple corpora and multiple standard question sentences.
  • Step S520 Based on the training data set, each corpus is used as input data, and each standard sentence pattern is used as output data, and a machine learning algorithm is used for training to obtain a trained sentence pattern mining model.
  • a machine learning algorithm may be used for training, so as to obtain a sentence mining model.
  • the machine learning algorithms used can include: neural network, Long Short-Term Memory (LSTM) network, threshold loop unit, simple loop unit, auto encoder, decision tree, random forest, feature mean classification, classification Regression tree, hidden Markov, K-Nearest Neighbor (KNN) algorithm, logistic regression model, Bayesian model, Gaussian model and KL divergence (Kullback-Leibler divergence), etc.
  • the specific machine learning algorithm may not be used as a limitation.
  • the following takes a neural network as an example to illustrate the training of the initial model based on the training data set.
  • the corpus in a set of data in the training data set is used as the input sample (input data) of the neural network, and the standard sentence pattern in the set of data is used as the output sample (output data) of the neural network.
  • the neurons in the input layer are fully connected with the neurons in the hidden layer, and the neurons in the hidden layer are fully connected with the neurons in the output layer, which can effectively extract potential features of different granularities.
  • the number of hidden layers can be multiple, which can better fit the non-linear relationship and make the sentence mining model obtained by training more accurate.
  • the training process of the sentence mining model may or may not be completed by electronic equipment.
  • the electronic device can be used only as a direct user or an indirect user.
  • the sentence mining model may periodically or irregularly obtain new training data, and the sentence mining model can be trained and updated.
  • Step S530 Obtain multiple corpora to be mined.
  • Step S540 Perform a two-sequence comparison on the plurality of corpora to be mined to obtain a plurality of general sentence patterns corresponding to the plurality of corpora to be mined.
  • Step S550 Filter the multiple general sentence patterns, and filter the general sentence patterns that meet the specified standard from the multiple general sentence patterns as the standard sentence pattern.
  • step S530 to step S540 please refer to step S110 to step S130, which will not be repeated here.
  • the sentence pattern mining method provided in another embodiment of this application obtains a training data set.
  • the training data set includes multiple corpora and multiple standard sentence patterns. Based on the training data set, each corpus is used as input data, and each standard Sentence patterns are used as output data to be trained through machine learning algorithms to obtain trained sentence pattern mining models, obtain multiple corpora to be mined, and perform double-sequence comparisons on multiple corpora to be mined to obtain multiple corresponding multiple corpora to be mined.
  • a common sentence pattern is used to filter multiple general sentence patterns, and a general sentence pattern that meets the specified standard is selected from the multiple general sentence patterns as the standard sentence pattern.
  • this embodiment also collects training data sets for training to obtain a sentence pattern mining model for standard sentence pattern mining of corpus, so as to improve the accuracy of the obtained standard sentence pattern.
  • FIG. 11 shows a block diagram of a sentence pattern mining device 200 provided by an embodiment of the present application. The following will describe the block diagram shown in FIG. 11.
  • the sentence pattern mining device 200 includes: acquisition of corpus to be mined The module 210, the general sentence pattern obtaining module 220, and the standard sentence pattern obtaining module 230, in which:
  • the corpus to be mined acquisition module 210 is used to obtain multiple corpora to be mined.
  • the general sentence pattern obtaining module 220 is configured to perform a double sequence comparison on the plurality of corpus to be mined to obtain a plurality of general sentence patterns corresponding to the plurality of corpus to be mined.
  • the general sentence pattern obtaining module 220 includes: a sequence type obtaining submodule, a processing mode determining submodule, and a general sentence pattern obtaining submodule, wherein:
  • the sequence type acquisition sub-module is used to acquire the sequence type of each of the plurality of corpus to be mined.
  • the processing mode determination sub-module is used to determine the processing mode for the dual sequence comparison of the multiple corpora to be mined based on the sequence type of each corpus to be mined.
  • processing mode determining sub-module includes: a processing mode determining unit, wherein:
  • the processing mode determining unit is configured to determine the processing mode of the dual sequence alignment of the multiple corpus to be mined from the global comparison and the local comparison based on the sequence type of each corpus to be mined.
  • processing mode determining unit includes: a sequence similarity obtaining subunit and a processing mode determining subunit, wherein:
  • the sequence similarity obtaining subunit is configured to obtain the sequence similarity between the multiple corpora to be mined based on the sequence type of each corpus to be mined.
  • the processing mode determining subunit is used to determine, based on the sequence similarity between the multiple corpora to be mined, from the global comparison and the local comparison to perform a dual sequence alignment on the multiple corpora to be mined Processing method.
  • processing mode determining sub-unit includes: a first processing mode determining sub-unit and a second processing mode determining sub-subunit, wherein:
  • the first processing mode determination sub-unit is used to determine the global comparison as performing a double sequence on the plurality of corpora to be mined when the sequence similarity between the plurality of corpora to be mined is greater than the specified similarity The processing method of the comparison.
  • the second processing mode determination sub-unit is used to determine the local comparison as double-checking the plurality of corpora to be mined when the sequence similarity between the plurality of corpora to be mined is not greater than the specified similarity.
  • the processing method of sequence alignment is used to determine the local comparison as double-checking the plurality of corpora to be mined when the sequence similarity between the plurality of corpora to be mined is not greater than the specified similarity.
  • the general sentence pattern obtaining submodule is configured to perform a double sequence comparison on the plurality of corpus to be mined based on the processing method, and obtain a plurality of general sentence patterns corresponding to the plurality of corpus to be mined.
  • the standard sentence pattern obtaining module 230 is configured to filter the multiple general sentence patterns, and filter the general sentence patterns that meet the specified standard from the multiple general sentence patterns as the standard sentence pattern.
  • the standard sentence pattern obtaining module 230 includes: an information obtaining submodule and a standard sentence pattern obtaining submodule, wherein:
  • the information acquisition sub-module is used to acquire the sentence pattern inclusion relationship between the multiple general sentence patterns, and acquire the sentence complexity of each general sentence pattern in the multiple general sentence patterns.
  • the information acquisition sub-module includes: a sentence complexity acquisition unit, wherein:
  • Sentence complexity acquisition unit used based on Acquire the sentence complexity of each general sentence pattern in the plurality of general sentence patterns, where n represents the number of times the general sentence pattern is divided, and t represents the number of words in each segment of the general sentence pattern .
  • the standard sentence pattern obtaining submodule is used to filter the multiple general sentence patterns based on the sentence pattern inclusion relationship between the multiple general sentence patterns and the sentence complexity of each general sentence pattern, Among the multiple general sentence patterns, the general sentence pattern that meets the specified standard is selected as the standard sentence pattern.
  • the standard sentence pattern obtaining submodule includes: a first standard sentence pattern obtaining unit, wherein:
  • the first standard sentence pattern obtaining unit is used to filter from the plurality of general sentence patterns that the sentence inclusion relationship with other general sentence patterns meets the first specified standard, and the sentence complexity meets the second specified standard
  • the general sentence pattern is regarded as the standard sentence pattern.
  • the standard sentence pattern obtaining submodule includes: a picture entry degree obtaining unit and a second standard sentence pattern obtaining unit, wherein:
  • the image entry degree obtaining unit is configured to obtain the image entry degree of each general sentence pattern in the plurality of general sentence patterns based on the sentence pattern inclusion relationship between the plurality of general sentence patterns.
  • the second standard sentence pattern obtaining unit is used to filter the plurality of general sentence patterns based on the image entry degree of each general sentence pattern and the complexity of each general sentence pattern. From the sentence patterns, select general sentence patterns that meet the specified criteria as standard sentence patterns.
  • the second standard sentence pattern obtaining unit includes: a standard sentence pattern obtaining subunit, wherein:
  • the standard sentence pattern obtaining subunit is used to filter out the general sentence patterns whose picture entry degree meets the third specified standard and the sentence pattern complexity meets the second specified standard from the plurality of common sentence patterns as the standard sentence pattern.
  • the standard sentence pattern obtaining sub-monocycle includes: the standard sentence pattern obtaining sub-unit, wherein:
  • the standard sentence pattern obtaining sub-unit is used to filter out the general sentence patterns with the picture in-degree greater than the specified picture-in degree and the sentence complexity greater than the specified complexity from the plurality of common sentence patterns as the standard sentence pattern.
  • the sentence pattern mining device 200 further includes: a standard sentence pattern output module, wherein:
  • the standard sentence pattern output module is used to output the standard sentence pattern.
  • the standard sentence pattern output module includes: a standard reply sentence pattern acquisition submodule and a standard sentence pattern output submodule, wherein:
  • the standard reply sentence pattern acquisition submodule is used to acquire the standard reply sentence pattern based on the standard sentence pattern when the standard sentence pattern is an inquiry sentence pattern.
  • the standard sentence pattern output sub-module is used to output the standard sentence pattern and the standard reply sentence pattern.
  • the sentence pattern mining device 200 further includes: a training data set acquisition module and a sentence pattern mining model training module, wherein:
  • the training data set acquisition module is used to acquire a training data set, and the training data set includes a plurality of corpus and a plurality of standard sentence patterns.
  • the sentence pattern mining model training module uses each corpus as input data and each standard sentence pattern as output data, and trains through machine learning algorithms to obtain a trained sentence pattern mining model.
  • the coupling between the modules may be electrical, mechanical or other forms of coupling.
  • each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software functional modules.
  • FIG. 12 shows a structural block diagram of an electronic device 100 provided by an embodiment of the present application.
  • the electronic device 100 may be an electronic device capable of running application programs, such as a smart phone, a tablet computer, or an e-book.
  • the electronic device 100 in this application may include one or more of the following components: a processor 110, a memory 120, and one or more application programs.
  • One or more application programs may be stored in the memory 120 and configured to be Or multiple processors 110 execute, and one or more programs are configured to execute the method described in the foregoing method embodiment.
  • the processor 110 may include one or more processing cores.
  • the processor 110 uses various interfaces and lines to connect various parts of the entire electronic device 100, and executes by running or executing instructions, programs, code sets, or instruction sets stored in the memory 120, and calling data stored in the memory 120.
  • Various functions and processing data of the electronic device 100 may adopt at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA).
  • DSP Digital Signal Processing
  • FPGA Field-Programmable Gate Array
  • PDA Programmable Logic Array
  • the processor 110 may integrate one or a combination of a central processing unit (CPU), a graphics processing unit (GPU), a modem, and the like.
  • the CPU mainly processes the operating system, user interface, and application programs; the GPU is used for rendering and drawing the content to be displayed; the modem is used for processing wireless communication. It is understandable that the above-mentioned modem may not be integrated into the processor 110, but may be implemented by a communication chip alone.
  • the memory 120 may include random access memory (RAM) or read-only memory (Read-Only Memory).
  • the memory 120 may be used to store instructions, programs, codes, code sets or instruction sets.
  • the memory 120 may include a program storage area and a data storage area, where the program storage area may store instructions for implementing an operating system and instructions for implementing at least one function (such as touch function, sound playback function, image playback function, etc.) , Instructions used to implement the following various method embodiments, etc.
  • the storage data area can also store data (such as phone book, audio and video data, chat record data) created by the electronic device 100 during use.
  • FIG. 13 shows a structural block diagram of a computer-readable storage medium provided by an embodiment of the present application.
  • the computer-readable medium 300 stores program code, and the program code can be invoked by a processor to execute the method described in the foregoing method embodiment.
  • the computer-readable storage medium 300 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM.
  • the computer-readable storage medium 300 includes a non-transitory computer-readable storage medium.
  • the computer-readable storage medium 300 has storage space for the program code 310 for executing any method steps in the above-mentioned methods. These program codes can be read from or written into one or more computer program products.
  • the program code 310 may be compressed in a suitable form, for example.
  • the sentence mining method, device, electronic device, and storage medium acquire multiple corpora to be mined, and perform dual-sequence comparisons on the multiple corpora to be mined to obtain multiple corresponding corpora to be mined.
  • Multiple general sentence patterns of, filter multiple general sentence patterns, and filter the general sentence patterns that meet the specified criteria from the multiple general sentence patterns as the standard sentence pattern, so as to obtain the general sentence by double-sequence comparison of the corpus to be mined Then filter the common sentence patterns to obtain the standard sentence patterns, so as to quickly and conveniently obtain the standard sentence patterns from the corpus to be mined for processing.

Abstract

一种句式挖掘方法、装置、电子设备以及存储介质,涉及电子设备技术领域。所述方法包括:获取多条待挖掘语料(S110),对多条待挖掘语料进行双序列比对,获得多条待挖掘语料对应的多个通用句式(S120),对多个通用句式进行过滤,从多个通用句式中筛选出符合指定标准的通用句式作为标准句式(S130)。所述方法通过对待挖掘语料进行双序列比对获得通用句式,再对通用句式进行过滤获得标准句式,以快速便捷的从待挖掘语料中获得标准句式以供处理。

Description

句式挖掘方法、装置、电子设备以及存储介质 技术领域
本申请涉及电子设备技术领域,更具体地,涉及一种句式挖掘方法、装置、电子设备以及存储介质。
背景技术
在实际的互联网业务中,经常可以接触到大量的格式化信息,如何通过通用句式挖掘有效处理这些结构化信息分析成为众多自然语言处理研究者关注的方向之一。
发明内容
鉴于上述问题,本申请提出了一种句式挖掘方法、装置、电子设备以及存储介质,以解决上述问题。
第一方面,本申请实施例提供了一种句式挖掘方法,所述方法包括:获取多条待挖掘语料;对所述多条待挖掘语料进行双序列比对,获得所述多条待挖掘语料对应的多个通用句式;对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式。
第二方面,本申请实施例提供了一种句式挖掘装置,所述装置包括:待挖掘语料获取模块,用于获取多条待挖掘语料;通用句式获得模块,用于对所述多条待挖掘语料进行双序列比对,获得所述多条待挖掘语料对应的多个通用句式;标准句式获得模块,用于对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式。
第三方面,本申请实施例提供了一种电子设备,包括存储器和处理器,所述存储器耦接到所述处理器,所述存储器存储指令,当所述指令由所述处理器执行时所述处理器执行上述方法。
第四方面,本申请实施例提供了一种计算机可读取存储介质,所述计算机可读取存储介质中存储有程序代码,所述程序代码可被处理器调用执行上述方法。
本申请实施例提供的句式挖掘方法、装置、电子设备以及存储介质,获取多条待挖掘语料,对多条待挖掘语料进行双序列比对,获得多条待挖掘语料对应的多个通用句式,对多个通用句式进行过滤,从多个通用句式中筛选出符合指定标准的通用句式作为标准句式,从而通过对待挖掘语料进行双序列比对获得通用句式,再对通用句式进行过滤获得标准句式,以快速便捷的从待挖掘语料中获得标准句式以供处理。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1示出了本申请一个实施例提供的句式挖掘方法的流程示意图;
图2示出了本申请又一个实施例提供的句式挖掘方法的流程示意图;
图3示出了本申请实施例提供的多个通用句式之间的句式包含关系的示意图;
图4示出了本申请的图2所示的句式挖掘方法的步骤S240的流程示意图;
图5示出了本申请再一个实施例提供的句式挖掘方法的流程示意图;
图6示出了本申请的图5所示的句式挖掘方法的步骤S330的流程示意图;
图7示出了本申请的图6所示的句式挖掘方法的步骤S332的流程示意图;
图8示出了本申请另一个实施例提供的句式挖掘方法的流程示意图;
图9示出了本申请的图8所示的句式挖掘方法的步骤S440的流程示意图;
图10示出了本申请又再一个实施例提供的句式挖掘方法的流程示意图;
图11示出了本申请实施例提供的句式挖掘装置的模块框图;
图12示出了本申请实施例用于执行根据本申请实施例的句式挖掘方法的电子设备的框图;
图13示出了本申请实施例的用于保存或者携带实现根据本申请实施例的句式挖掘方法的程序代码的存储单元。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。
近年来,随着人工智能(artificial intelligence,AI)相关技术的飞速发展,已经有越来越多的应用场景得以落地实现,例如计算机视觉(computer vision,CV)和自然语音处理(natural language processing,NLP),极大的改善了人们的衣食住行。特别的,近些年研究者对于NLP的热衷使得相关语言模型的发展日新月异,例如谷歌基于纯粹注意力机制的transformer模型,基于transformer模型的BERT(bidirectional encoder representations from transformers)模型等都是最近的研究成果。在实际的互联网业务中,经常可以接触到大量的用户格式化信息,如何通过通用句式挖掘有效处理这些结构化信息,从而便于相应的NLP下游任务(例如智能客服、社区问答、短文本分类等)分析成为众多NLP研究者关注的方向之一。
一般地,目前的句式挖掘方法可以分为以下两类:
(1)人工挖掘正则表达式:通过人工分析格式化数据,找到相关句式的通用格式,生成正则表达式用于下游NLP任务。
(2)基于大规模语言模型的方式:利用大量语料训练,经过大规模语言模型(如BERT)训练得到相关固定句式的嵌入式表达。
发明人经过研究发现,对于人工挖掘正则表达式,通过人工发现整理的方式总结归纳出相关句式的正则表达式的方式虽然能保证准确率,但在智能客服、社区问答场景的数据符合长尾分布,很多特殊句式可能无法被有效挖掘,且数据量巨大,费时费力。对于基于大规模语言模型,短文本分类场景中部分句式的领域类别仅依赖于实体部分,如[entity]是什么,[entity]是谁,[entity]多样多变,该类问题的分类无法很好的使用基于神经网络的语言模型处理,因此希望挖掘出相关句式,通过整合句式和[entity]校验的方式处理此类问题;且基于神经网络的语言模型实验成本极高,计算周期长,不适用于拥有大量语料数据并希望快速迭代落地的中小企业。
针对上述问题,发明人经过长期的研究发现,并提出了本申请实施例提供的句式挖掘方法、装置、电子设备以及存储介质,通过对待挖掘语料进行双序列比对获得通用句式,再对通用句式进行过滤获得标准句式,以快速便捷的从待挖掘语料中获得标准句式以供处理。其中,具体的句式挖掘方法在后续的实施例中进行详细的说明。
请参阅图1,图1示出了本申请一个实施例提供的句式挖掘方法的流程示意图。所述句式挖掘方法用于通过对待挖掘语料进行双序列比对获得通用句式,再对通用句式进行过滤获得标准句式,以快速便捷的从待挖掘语料中获得标准句式以供处理。其中,在具体的实施例中,所述句式挖掘方法应用于如图11所示的句式挖掘装置200以及配置有句式挖掘装置200的电子设备100(图12)。下面将以电子设备为例,说明本实施例的具体流程,其中,本实施例所应用的电子设备可以包括移动终端、平板电脑、台式电脑、穿戴式电子设备等,在此不做限定。下面将针对图1所示的流程进行详细的阐述,所述句式挖掘方法具体可以包括以下步骤:
步骤S110:获取多条待挖掘语料。
在本实施例中,可以获取多条待挖掘语料。在一些实施方式中,多条待挖掘语料可以从社区问答中获取,可以从短文本中获取,也可以部分从社区问答中获取,另一部分从短文本中获取等,在此不做限定。
在一些实施方式中,多条待挖掘语料可以从服务器获取,例如,从服务器中记录的社区 问答或短文本中获取,多条待挖掘语料也可以从其他电子设备获取,例如,从其他电子设备记录的社区问答或短文本中获取,其中,当多条待挖掘语料从服务器或者其他电子设备获取时,可以通过无线网络或者数据网络从服务器或者其他电子设备获取。
在一些实施方式中,以多条待挖掘语料从社区问答中获取为例,可以从社区问答中获取“栗胸白脸刺莺是居住在哪个国家的鸟”作为待挖掘语料,可以从社区问答中获取“阿尔文是哪个国家的城市”作为待挖掘语料等,在此不做限定。
步骤S120:对所述多条待挖掘语料进行双序列比对,获得所述多条待挖掘语料对应的多个通用句式。
在本实施例中,在获取多条待挖掘语料后,可以对多条待挖掘语料进行双序列比对(pairwise alignment),获得多条待挖掘语料对应的多个通用句式。其中,双序列比对是生物信息学研究的领域之一,其研究方法是设计具有针对性的有效算法对两个DNA或蛋白质序列进行比较,找出两者之间的最大相似性匹配进而判断其是否具有同源性。于本实施例中,采用双序列比对的方式对多条待挖掘语料进行处理,以获取多条待挖掘语料之间的最大相似匹配句式,即多条待挖掘语料对应的多个通用句式,从而通过引入生物信息学中的双序列比对算法对句式学习进行迁移,能够在字节单位进行匹配句式,避免了传统分割方法由于语义分割错误和人为拼写错误造成的误差。在一些实施方式中,在获取多条待挖掘语料后,可以将多条待挖掘语料两两进行双序列比对,获得多条待挖掘语料对应的多个通用句式。
例如,以多个待挖掘语料包括“栗胸白脸刺莺是居住在哪个国家的鸟”和“阿尔文是哪个国家的城市”为例,对待挖掘语料“栗胸白脸刺莺是居住在哪个国家的鸟”和待挖掘语料“阿尔文是哪个国家的城市”进行双序列比对,获得多条待挖掘语料的通用句式为:(.+?)是(.+?)哪个国家的(.+?)。又例如,以多个待挖掘语料包括“成都坐火车去北京要多久”和“成都坐飞机去北京要多久”为例,对待挖掘语料“成都坐火车去北京要多久”和待挖掘语料“成都坐飞机去北京要多久”进行双序列比对,获得多条待挖掘语料的通用句式为:成都(.+?)去北京要多久。
步骤S130:对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式。
其中,对多个待挖掘语料进行双序列比对,一般会提取出大量的通用句式,因此,可以采用一种量化的机制能够挖掘出具有一定具象含义且有一定的泛化能力的句式。在本实施例中,在对多条待挖掘语料进行双序列比对获得多个通用句式后,可以对多个通用句式进行过滤,以从多个通用句式中筛选出符合指定标准的通用句式作为标准句式,其中,符合指定标准的通用句式可以指具有一定具象含义且具有一定泛化能力的句式,从而用量化的指标衡量标准句式的泛化程度和具象意义,以使得从多个待挖掘语料中挖掘获得的标准句式更加准确。
在一些实施方式中,可以预先设置并存储通用句式过滤规则,在获得多条待挖掘语料对应的多个通用句式后,可以基于通用句式过滤规则对多个通用句式进行过滤,以从多个通用句式中筛选出符合指定标准的通用句式作为标准句式。作为一种方式,在获得多条待挖掘语料对应的多个通用句式后,可以依次判断多个通用句式是否满足通用句式过滤规则,并根据判断结果从多个通用句式中筛选出符合指定标准的通用句式作为标准句式,具体地,可以将判断结果表征满足通用句式过滤规则的通用句式确定为满足指定标准,即确定为标准句式,将判断结果表征不满足通用句式过滤规则的通用句式确定为不满足指定标准,即确定为非标准句式。
本申请一个实施例提供的句式挖掘方法,获取多条待挖掘语料,对多条待挖掘语料进行双序列比对,获得多条待挖掘语料对应的多个通用句式,对多个通用句式进行过滤,从多个通用句式中筛选出符合指定标准的通用句式作为标准句式,从而通过对待挖掘语料进行双序列比对获得通用句式,再对通用句式进行过滤获得标准句式,以快速便捷的从待挖掘语料中获得标准句式以供处理。
请参阅图2,图2示出了本申请又一个实施例提供的句式挖掘方法的流程示意图。 下面将针对图2所示的流程进行详细的阐述,所述句式挖掘方法具体可以包括以下步骤:
步骤S210:获取多条待挖掘语料。
步骤S220:对所述多条待挖掘语料进行双序列比对,获得所述多条待挖掘语料对应的多个通用句式。
其中,步骤S210-步骤S220的具体描述请参阅步骤S110-步骤S120,在此不再赘述。
步骤S230:获取所述多个通用句式之间的句式包含关系,并获取所述多个通用句式中的每个通用句式的句式复杂度。
在本实施例中,在获取多个通用句式后,可以获取多个通用句式之间的句式包含关系。在一些实施方式中,在获取多个通用句式后,可以基于多个通用句式的样本覆盖量,获取多个通用句式之间的句式包含关系,具体地,在获取多个通用句式后,可以基于多个通用句式的样本覆盖量,划分出父子节点关系,将覆盖量最大的句式设定为父节点,根据剩余通用句式从大到小的样本覆盖量,划分出从上到下不同层级的子节点,也就是说,父节点的泛化能力最大,但其不具备一定的具象含义,从上到下不同层级的子节点的泛化能力依次减小,但其具象含义依次增大。
请参阅图3,图3示出了本申请实施例提供的多个通用句式之间的句式包含关系的示意图。如图3所示,多个通用句式包括:通用句式S 0、通用句式S 1、通用句式
Figure PCTCN2020084769-appb-000001
通用句式
Figure PCTCN2020084769-appb-000002
通用句式
Figure PCTCN2020084769-appb-000003
通用句式
Figure PCTCN2020084769-appb-000004
通用句式
Figure PCTCN2020084769-appb-000005
……,其中,通用句式S 0覆盖通用句式
Figure PCTCN2020084769-appb-000006
通用句式
Figure PCTCN2020084769-appb-000007
以及通用句式
Figure PCTCN2020084769-appb-000008
通用句式S 1覆盖通用句式
Figure PCTCN2020084769-appb-000009
通用句式
Figure PCTCN2020084769-appb-000010
以及通用句式
Figure PCTCN2020084769-appb-000011
通用句式
Figure PCTCN2020084769-appb-000012
覆盖通用句式
Figure PCTCN2020084769-appb-000013
和通用句式
Figure PCTCN2020084769-appb-000014
通用句式
Figure PCTCN2020084769-appb-000015
覆盖通用句式
Figure PCTCN2020084769-appb-000016
和通用句式
Figure PCTCN2020084769-appb-000017
通用句式
Figure PCTCN2020084769-appb-000018
覆盖通用句式
Figure PCTCN2020084769-appb-000019
因此,可以将通用句式S 0和通用句式S 1确定为父节点,将通用句式
Figure PCTCN2020084769-appb-000020
通用句式
Figure PCTCN2020084769-appb-000021
通用句式
Figure PCTCN2020084769-appb-000022
通用句式
Figure PCTCN2020084769-appb-000023
通用句式
Figure PCTCN2020084769-appb-000024
……确定为子节点。
在本实施例中,在获取多个通用句式后,可以获取多个通用句式中的每个通用句式的句式复杂度。其中,通用句式的句式复杂度越大,表征该通用句式越复杂,越具有具象含义,通用局势哦的句式复杂度越小,表征该通用句式越简单,越不具有具象含义。在一些实施方式中,可以基于
Figure PCTCN2020084769-appb-000025
获取多个通用句式中的每个通用句式的句式复杂度,其中,n表示通用句式被分割的次数,t表示通用句式中的每个分隔段的字数,例如,通用句式“(.+?)是(.+?)哪个国家的(.+?)”的句式复杂度
Figure PCTCN2020084769-appb-000026
步骤S240:基于所述多个通用句式之间的句式包含关系和所述每个通用句式的句式复杂度对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式。
其中,多个通用句式之间的句式包含关系可以用于反应多个通用句式中的每个通用句式的泛化能力,多个通用句式中的每个通用句式的句式复杂度可以用于反应多个通用句式中的每个通用句式的具象含义,因此,在本实施例中,在获取多个通用句式之间的句式包含关系和每个通用句式的句式复杂度后,可以基于多个通用句式之间的句式包含关系和每个通用句式的句式复杂度对多个通用句式进行过滤,以从多个通用句式中筛选出符合指定标准的通用句式作为标准句式。可以理解的,从多个通用句式中筛选出的符合指定标准的通用句式,可以根据需求具有一定的泛化能力且具有一定的具象含义。
在一些实施方式中,若所设定的需求为筛选出泛化能力较强且具象含义较弱的通用句式时,则可以基于多个通用句式之间的句式包含关系和每个通用句式的句式复杂度对多个通用句式进行过滤,以筛选出样本覆盖量较大且句式复杂度较小的通用句式作为标准句式。
在一些实施方式中,若所设定的需求为筛选出泛化能力较弱且具象含义较强的通用句式时,则可以基于多个通用句式之间的句式包含关系和每个通用句式的句式复杂度对多个通用句式进行过滤,以筛选出样本覆盖量较小且句式复杂度较大的通用句式作为标准句式。
在一些实施方式中,若所设定的需求为筛选出具有一定的泛化能力且具有一定的具象含 义的通用句式时,则可以基于多个通用句式之间的句式包含关系和每个通用句式的句式复杂度对多个通用句式进行过滤,以从多个通用句式中筛选出与其他通用句式之间的句式包含关系满足第一指定标准,且句式复杂度满足第二指定标准的通用句式作为标准句式。其中,第一指定标准可以预先设置并存储作为某个通用句式与其他通用句式之间的句式包含关系的判断依据,因此,在获取某个通用句式与其他通用句式之间的句式包含关系后,可以将某个通用句式与其他通用句式之间的句式包含关系与第一指定标准进行比较,以判断某个通用句式与其他通用句式之间的句式包含关系是否满足第一指定标准。其中,第二指定标准可以预先设置并存储作为每个通用句式的句式复杂度的判断依据,因此,在获取每个通用句式的句式复杂度后,可以将每个通用句式的句式复杂度与第二指定标准机型比较,以判断每个通用句式的句式复杂度是否满足第二指定标准。
请参阅图4,图4示出了本申请的图2所示的句式挖掘方法的步骤S240的流程示意图。下面将针对图4所示的流程进行详细的阐述,所述方法具体可以包括以下步骤:
步骤S241:基于所述多个通用句式之间的句式包含关系,获取所述多个通用句式中的每个通用句式的图入度。
在本实施例中,在获取多个通用句式之间的句式包含关系后,可以基于多个通用句式之间的句式包含关系,获取多个通用句式中的每个通用句式的图入度。在一些实施方式中,在获取多个通用句式之间的句式包含关系后,可以基于多个通用句式之间的句式包含关系,获取多个通用句式中的每个通用句式的图入度
Figure PCTCN2020084769-appb-000027
其中,图入度
Figure PCTCN2020084769-appb-000028
在一定程度上反应了该通用句式的泛化能力,如图3所示,多个通用句式中的通用句式
Figure PCTCN2020084769-appb-000029
的图入度
Figure PCTCN2020084769-appb-000030
多个通用句式中的通用句式
Figure PCTCN2020084769-appb-000031
的图入度
Figure PCTCN2020084769-appb-000032
说明通用句式
Figure PCTCN2020084769-appb-000033
的泛化能力比通用句式
Figure PCTCN2020084769-appb-000034
的泛化能力强。
步骤S242:基于所述每个通用句式的图入度和所述每个通用句式的复杂度对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式。
其中,多个通用句式中的每个通用句式的图入度可以用于反应多个通用句式中的每个通用句式的泛化能力,多个通用句式中的每个通用句式的句式复杂度可以用于反应多个通用句式中的每个通用句式的具象含义,因此,在本实施例中,在获取每个通用句式的图入度和每个通用句式的句式复杂度后,可以基于每个通用句式的图入度和每个通用句式的句式复杂度对多个通用句式进行过滤,以从多个通用句式中筛选出符合指定标准的通用句式作为标准句式。可以理解的,从多个通用句式中筛选出的符合指定标准的通用句式,可以根据需求具有一定的泛化能力且具有一定的具象含义。
在一些实施方式中,若所设定的需求为筛选出泛化能力较强且具象含义较弱的通用句式时,则可以基于每个通用句式的图入度和每个通用句式的句式复杂度对多个通用句式进行过滤,以筛选出图入度较大且句式复杂度较小的通用句式作为标准句式。
在一些实施方式中,若所设定的需求为筛选出泛化能力较弱且具象含义较强的通用句式时,则可以基于每个通用句式的图入度和每个通用句式的句式复杂度对多个通用句式进行过滤,以筛选出图入度较小且句式复杂度较大的通用句式作为标准句式。
在一些实施方式中,若所设定的需求为筛选出具有一定的泛化能力且具有一定的具象含义的通用句式时,则可以基于每个通用句式的图入度和每个通用句式的句式复杂度对多个通用句式进行过滤,以从多个通用句式中筛选出图入度满足第三指定标准,且句式复杂度满足第二指定标准的通用句式作为标准句式。其中,第三指定标准可以预先设置并存储作为通用句式的图入度的判断依据,因此,在获取通用句式的图入度后,可以将通用句式的图入度与第三指定标准进行比较,以判断通用句式的图入度是否满足第三指定标准。
在一些实施方式中,可以预先设置并存储指定图入度,该指定图入度用于作为每个通用句式的图入度的判断依据,其中,当通用句式的图入度大于指定图入度时,可以确定该通用句式的图入度满足第三指定标准,当通用句式的图入度不大于指定图入度时,可以确定该通用句式的图入度不满足第三指定标准。可以预先设置并存储指定复杂度,该指定复杂度用于 作为每个通用句式的复杂度的判断依据,其中,当通用句式的复杂度大于指定复杂度时,可以确定该通用句式的复杂度满足第二指定标准,当通用句式的复杂度不大于指定复杂度时,可以确定该通用句式的复杂度不满足第二指定标准。因此,在本实施例中,基于上述指定图入度和指定复杂度,可以从多个通用句式中筛选出图入度大于指定图入度,且句式复杂度大于指定复杂度的通用句式作为标准句式,以使获得的标准句式具有一定的泛化能力且具有一定的具象含义。
本申请又一个实施例提供的句式挖掘方法,获取多条待挖掘语料,对多条待挖掘语料进行双序列比对,获得多条待挖掘语料的多个通用句式,获取多个通用句式之间的句式包含关系,并获取多个通用句式中的每个通用句式的句式复杂度,基于多个通用句式之间的句式包含关系和每个通用句式的句式复杂度对多个通用句式进行过滤,从多个通用句式中筛选出符合指定标准的通用句式作为标准句式。相较于图1所示的句式挖掘方法,本实施例通过获取多个通用句式之间的句式包含关系和每个通用句式的句式复杂度对多个通用句式进行过滤,以获取标准句式,以提升获取的标准句式的准确性。
请参阅图5,图5示出了本申请再一个实施例提供的句式挖掘方法的流程示意图。下面将针对图5所示的流程进行详细的阐述,所述句式挖掘方法具体可以包括以下步骤:
步骤S310:获取多条待挖掘语料。
其中,步骤S310的具体描述请参阅步骤S110,在此不再赘述。
步骤S320:获取所述多条待挖掘语料中的每条待挖掘语料的序列类型。
在一些实施方式中,双序列比对可以包括全局比对和局部比对,其中,全局比对是将每个通用句式中的每个剩余部分对齐,通常应用于序列类型相似或者序列长度大致相同的情况,在本实施例中,全局比对可以为基于动态规划的Needleman–Wunsch算法,局部比对更适用于序列类型不太相似的情况,在本实施例中,局部比对可以为Smith–Waterman算法。
在本实施例中,为了从全局比对和局部比对中选取一种更适用的方式对多条待挖掘语料进行双序列比对,可以获取多条待挖掘语料中的每条待挖掘语料的序列类型。
步骤S330:基于所述每条待挖掘语料的序列类型,确定对所述多条待挖掘语料进行双序列比对的处理方式。
在本实施例中,在获取每条待挖掘语料的序列类型后,可以基于每条待挖掘语料的序列类型,确定对多条待挖掘语料进行双序列比对的处理方式。在一些实施方式中,在获取每条待挖掘语料的序列类型后,可以基于每条待挖掘语料的序列类型,从全局比对和局部比对中确定对多条待挖掘语料进行双序列比对的处理方式。
请参阅图6,图6示出了本申请的图5所示的句式挖掘方法的步骤S330的流程示意图。下面将针对图6所示的流程进行详细的阐述,所述方法具体可以包括以下步骤:
步骤S331:基于所述每条待挖掘语料的序列类型,获取所述多条待挖掘语料之间的序列相似度。
在一些实施方式中,在获取每条待挖掘语料的序列类型后,可以基于每条待挖掘语料的序列类型,获取多条待挖掘语料之间的序列相似度。作为一种方式,在获取每条待挖掘语料的序列类型后,可以将多条待挖掘语料的序列类型进行匹配,以获取多条待挖掘语料之间的序列相似度。
步骤S332:基于所述多条待挖掘语料之间的序列相似度,从所述全局比对和所述局部比对中确定对所述多条待挖掘语料进行双序列比对的处理方式。
在一些实施方式中,在获取多条待挖掘语料之间的序列相似度后,可以基于多条待挖掘语料之间的序列相似度,从全局比对和局部比对中确定对多条待挖掘语料进行双序列比对的处理方式,即基于多条待挖掘语料之间的序列相似度,确定采用全局比对作为对多条待挖掘语料进行双序列比对的处理方式,或者确定采用局部比对作为多条待挖掘语料进行双序列比对的处理方式。
请参阅图7,图7示出了本申请的图6所示的句式挖掘方法的步骤S332的流程示意图。下面将针对图7所示的流程进行详细的阐述,所述方法具体可以包括以下步骤:
步骤S3321:当所述多条待挖掘语料之间的序列相似度大于指定相似度时,将所述全局比对确定为对所述多条待挖掘语料进行双序列比对的处理方式。
其中,由于全局比对是将每个序列中的每个剩余部分对齐,通常应用于序列类型相似或者序列长度大致相同的情况,因此,在本实施例中,当多条待挖掘语料之间的序列相似度大于指定相似度时,可以将全局比对确定为对多条待挖掘语料进行双序列比对的处理方式。
步骤S3322:当所述多条待挖掘语料之间的序列相似度不大于指定相似度时,将所述局部比对确定为对所述多条待挖掘语料进行双序列比对的处理方式。
其中,由于局部比对更加适用于序列类型不太相似的情况,因此,在本实施例中,在多条待挖掘语料之间的序列相似度不大于指定相似度时,可以将局部比对确定为对多条待挖掘语料进行双序列比对的处理方式。
步骤S340:基于所述处理方式对所述多条待挖掘语料进行双序列比对,获得所述多条待挖掘语料对应的多个通用句式。
步骤S350:对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式。
其中,步骤S340-步骤S350的具体描述请参阅步骤S120-步骤S130,在此不再赘述。
本申请再一个实施例提供的句式挖掘方法,获取多条待挖掘语料,获取多条待挖掘语料中的每条待挖掘语料的序列类型,基于每条待挖掘语料的序列类型,确定对多条待挖掘语料进行双序列比对的处理方式,基于该处理方式对多条待挖掘语料进行双序列比对,获得多条待挖掘语料对应的多个通用句式,对多个通用句式进行过滤,从多个通用句式中筛选出符合指定标准的通用句式作为标准句式。相较于图1所示的句式挖掘方法,本实施例基于每条待挖掘语料的语料类型,确定所采用的双序列比对方式,以提升获得的通用句式的准确性。
请参阅图8,图8示出了本申请另一个实施例提供的句式挖掘方法的流程示意图。下面将针对图8所示的流程进行详细的阐述,所述句式挖掘方法具体可以包括以下步骤:
步骤S410:获取多条待挖掘语料。
步骤S420:对所述多条待挖掘语料进行双序列比对,获得所述多条待挖掘语料对应的多个通用句式。
步骤S430:对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式。
其中,步骤S410-步骤S430的具体描述请参阅步骤S110-步骤S130,在此不再赘述。
步骤S440:输出所述标准句式。
在一些实施方式中,在获取标准句式后,可以输出该标准句式为后续NLP下游任务服务。基于此,本实施例可以用于辅助意图识别:自动从用户历史问答数据中挖掘高频问句/问法,辅助分析人员/产品经理快速了解用户意图,解放人工成本。基于此,本实施例还可以用于提升文本分类模型效果:短文本分类任务中,部分句式配合实体信息有效处理实体依赖的分类文本,作为先验/外部知识嵌入模型。基于此,本实施例还可以用于社区问答任务答案模板:NLP问答任务中,发现用户的高频问法,再针对性地准备答案模板句式(部分敏感性垂域问答中一些问题的答案需要限定为某种句式,如金融客服),或者从大规模社区问答(Q,A)对中挖掘Q和A的句式,将A整理为Q的答案模板。
请参阅图9,图9示出了本申请的图8所示的句式挖掘方法的步骤S440的流程示意图。下面将针对图9所示的流程进行详细的阐述,所述方法具体可以包括以下步骤:
步骤S441:当所述标准句式为询问句式时,基于所述标准句式获取标准答复句式。
在一些实施方式中,可以对所确定的标准句式的句式格式进行识别,其中,句式格式可以包括陈述句式、询问句式等,在本实施例中,当识别到标准句式为询问句式时,可以基于 该标准句式获取与该标准句式对应的标准答复句式,其中,一个标准句式可以对应一个标准答复句式,可以对应多个标准答复句式等,在此不做限定。
步骤S442:输出所述标准句式和所述标准答复句式。
在一些实施方式中,在获取标准句式和标准答复句式后,可以输出该标准句式和标准答复句式。
本申请另一个实施例提供的句式挖掘方法,获取多条待挖掘语料,对多条待挖掘语料进行双序列比对,获得多条待挖掘语料对应的多个通用句式,对多个通用句式进行过滤,从多个通用句式中筛选出符合指定标准的通用句式作为标准句式,输出标准句式。相较于图1所示的句式挖掘方法,本实施例还输出标准句式以供相应的下游任务使用,以提升下游任务的的准确响应。
请参阅图10,图10示出了本申请又再一个实施例提供的句式挖掘方法的流程示意图。下面将针对图10所示的流程进行详细的阐述,所述句式挖掘方法具体可以包括以下步骤:
步骤S510:获取训练数据集,所述训练数据集包括多个语料和多个标准句式。
其中,本申请实施例中还包括句式挖掘模型的训练方法,其中,对句式挖掘模型的训练可以是根据获取的训练数据集预先进行的,后续在每次进行句式挖掘时,则可以根据该句式挖掘模型进行挖掘处理,而无需每次进行句式时对句式挖掘模型进行训练。
在一些实施方式中,可以收集训练数据集,其中,训练数据集包括多个语料和多个标准问句。
步骤S520:基于所述训练数据集,将每个语料作为输入数据,以及每个标准句式作为输出数据,通过机器学习算法进行训练,获得已训练的句式挖掘模型。
在本申请实施例中,针对该训练数据集,可以采用机器学习算法进行训练,从而获句式挖掘模型。其中,采用的机器学习算法可以包括:神经网络、长短期记忆(Long Short-Term Memory,LSTM)网络、门限循环单元、简单循环单元、自动编码器、决策树、随机森林、特征均值分类、分类回归树、隐马尔科夫、K最近邻(k-NearestNeighbor,KNN)算法、逻辑回归模型、贝叶斯模型、高斯模型以及KL散度(Kullback–Leibler divergence)等。具体的机器学习算法可以不作为限定。
下面以神经网络为例,对根据训练数据集合训练初始模型进行说明。
训练数据集中一组数据中的语料作为神经网络的输入样本(输入数据),一组数据中的标准句式作为神经网络的输出样本(输出数据)。输入层中的神经元与隐藏层的神经元全连接,隐藏层的神经元与输出层的神经元全连接,从而能够有效提取不同粒度的潜在特征。并且隐藏层数目可以为多个,从而能更好地拟合非线性关系,使得训练得到的句式挖掘模型更加准确。
可以理解的,对句式挖掘模型的训练过程可以由电子设备完成,也可以不由电子设备完成。当训练过程不由电子设备完成时,则电子设备可以只是作为直接使用者,也可以是间接使用者。
在一些实施方式中,句式挖掘模型可以周期性的或者不定期的获取新的训练数据,对该句式挖掘模型进行训练和更新。
步骤S530:获取多条待挖掘语料。
步骤S540:对所述多条待挖掘语料进行双序列比对,获得所述多条待挖掘语料对应的多个通用句式。
步骤S550:对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式。
其中,步骤S530-步骤S540的具体描述请参阅步骤S110-步骤S130,在此不再赘述。
本申请又再一个实施例提供的句式挖掘方法,获取训练数据集,训练数据集包括多个语料和多个标准句式,基于训练数据集,将每个语料作为输入数据,以及每个标准句 式作为输出数据,通过机器学习算法进行训练,获得已训练的句式挖掘模型,获取多条待挖掘语料,对多条待挖掘语料进行双序列比对,获得多条待挖掘语料对应的多个通用句式,对多个通用句式进行过滤,从多个通用句式中筛选出符合指定标准的通用句式作为标准句式。相较于图1所示的句式挖掘方法,本实施例还收集训练数据集进行训练获得句式挖掘模型进行语料的标准句式挖掘,以提升获取的标准句式的准确性。
请参阅图11,图11示出了本申请实施例提供的句式挖掘装置200的模块框图,下面将针对图11所示的框图进行阐述,所述句式挖掘装置200包括:待挖掘语料获取模块210、通用句式获得模块220以及标准句式获得模块230,其中:
待挖掘语料获取模块210,用于获取多条待挖掘语料。
通用句式获得模块220,用于对所述多条待挖掘语料进行双序列比对,获得所述多条待挖掘语料对应的多个通用句式。
进一步地,通用句式获得模块220包括:序列类型获取子模块、处理方式确定子模块以及通用句式获得子模块,其中:
序列类型获取子模块,用于获取所述多条待挖掘语料中的每条待挖掘语料的序列类型。
处理方式确定子模块,用于基于所述每条待挖掘语料的序列类型,确定对所述多条待挖掘语料进行双序列比对的处理方式。
进一步地,所述处理方式确定子模块包括:处理方式确定单元,其中:
处理方式确定单元,用于基于所述每条待挖掘语料的序列类型,从全局比对和局部比对中确定对所述多条待挖掘语料进行双序列比对的处理方式。
进一步地,所述处理方式确定单元包括:序列相似度获取子单元和处理方式确定子单元,其中:
序列相似度获取子单元,用于基于所述每条待挖掘语料的序列类型,获取所述多条待挖掘语料之间的序列相似度。
处理方式确定子单元,用于基于所述多条待挖掘语料之间的序列相似度,从所述全局比对和所述局部比对中确定对所述多条待挖掘语料进行双序列比对的处理方式。
进一步地,所述处理方式确定子单元包括:第一处理方式确定子子单元和第二处理方式确定子子单元,其中:
第一处理方式确定子子单元,用于当所述多条待挖掘语料之间的序列相似度大于指定相似度时,将所述全局比对确定为对所述多条待挖掘语料进行双序列比对的处理方式。
第二处理方式确定子子单元,用于当所述多条待挖掘语料之间的序列相似度不大于指定相似度时,将所述局部比对确定为对所述多条待挖掘语料进行双序列比对的处理方式。
通用句式获得子模块,用于基于所述处理方式对所述多条待挖掘语料进行双序列比对,获得所述多条待挖掘语料对应的多个通用句式。
标准句式获得模块230,用于对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式。
进一步地,所述标准句式获得模块230包括:信息获取子模块和标准句式获得子模块,其中:
信息获取子模块,用于获取所述多个通用句式之间的句式包含关系,并获取所述多个通用句式中的每个通用句式的句式复杂度。
进一步地,所述信息获取子模块包括:句式复杂度获取单元,其中:
句式复杂度获取单元,用于基于
Figure PCTCN2020084769-appb-000035
获取所述多个通用句式中的每个通用句式的句式复杂度,其中,n表示所述通用句式被分割的次数,t表示所述通用句式中的每个分隔段的字数。
标准句式获得子模块,用于基于所述多个通用句式之间的句式包含关系和所述每个通用句式的句式复杂度对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式。
进一步地,所述标准句式获得子模块包括:第一标准句式获得单元,其中:
第一标准句式获得单元,用于从所述多个通用句式中筛选出与其他通用句式之间的句式包含关系满足第一指定标准,且句式复杂度满足第二指定标准的通用句式作为标准句式。
进一步地,所述标准句式获得子模块包括:图入度获取单元和第二标准句式获得单元,其中:
图入度获取单元,用于基于所述多个通用句式之间的句式包含关系,获取所述多个通用句式中的每个通用句式的图入度。
第二标准句式获得单元,用于基于所述每个通用句式的图入度和所述每个通用句式的复杂度对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式。
进一步地,所述第二标准句式获得单元包括:标准句式获得子单元,其中:
标准句式获得子单元,用于从所述多个通用句式中筛选出图入度满足第三指定标准,且句式复杂度满足第二指定标准的通用句式作为标准句式。
进一步地,所述标准句式获得子单环包括:标准句式获得子子单元,其中:
标准句式获得子子单元,用于从所述多个通用句式中筛选出图入度大于指定图入度,且句式复杂度大于指定复杂度的通用句式作为标准句式。
进一步地,所述句式挖掘装置200还包括:标准句式输出模块,其中:
标准句式输出模块,用于输出所述标准句式。
进一步地,所述标准句式输出模块包括:标准答复句式获取子模块和标准句式输出子模块,其中:
标准答复句式获取子模块,用于当所述标准句式为询问句式时,基于所述标准句式获取标准答复句式。
标准句式输出子模块,用于输出所述标准句式和所述标准答复句式。
进一步地,所述句式挖掘装置200还包括:训练数据集获取模块和句式挖掘模型训练模块,其中:
训练数据集获取模块,用于获取训练数据集,所述训练数据集包括多个语料和多个标准句式。
句式挖掘模型训练模块,那个鱼基于所述训练数据集,将每个语料作为输入数据,以及每个标准句式作为输出数据,通过机器学习算法进行训练,获得已训练的句式挖掘模型。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述装置和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,模块相互之间的耦合可以是电性,机械或其它形式的耦合。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
请参阅图12,其示出了本申请实施例提供的一种电子设备100的结构框图。该电子设备100可以是智能手机、平板电脑、电子书等能够运行应用程序的电子设备。本申请中的电子设备100可以包括一个或多个如下部件:处理器110、存储器120以及一个或多个应用程序,其中一个或多个应用程序可以被存储在存储器120中并被配置为由一个或多个处理器110执行,一个或多个程序配置用于执行如前述方法实施例所描述的方法。
其中,处理器110可以包括一个或者多个处理核。处理器110利用各种接口和线路连接整个电子设备100内的各个部分,通过运行或执行存储在存储器120内的指令、程序、代码集或指令集,以及调用存储在存储器120内的数据,执行电子设备100的各种功能和处理数据。可选地,处理器110可以采用数字信号处理(Digital Signal Processing,DSP)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑阵 列(Programmable Logic Array,PLA)中的至少一种硬件形式来实现。处理器110可集成中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)和调制解调器等中的一种或几种的组合。其中,CPU主要处理操作系统、用户界面和应用程序等;GPU用于负责待显示内容的渲染和绘制;调制解调器用于处理无线通信。可以理解的是,上述调制解调器也可以不集成到处理器110中,单独通过一块通信芯片进行实现。
存储器120可以包括随机存储器(Random Access Memory,RAM),也可以包括只读存储器(Read-Only Memory)。存储器120可用于存储指令、程序、代码、代码集或指令集。存储器120可包括存储程序区和存储数据区,其中,存储程序区可存储用于实现操作系统的指令、用于实现至少一个功能的指令(比如触控功能、声音播放功能、图像播放功能等)、用于实现下述各个方法实施例的指令等。存储数据区还可以存储电子设备100在使用中所创建的数据(比如电话本、音视频数据、聊天记录数据)等。
请参阅图13,其示出了本申请实施例提供的一种计算机可读存储介质的结构框图。该计算机可读介质300中存储有程序代码,所述程序代码可被处理器调用执行上述方法实施例中所描述的方法。
计算机可读存储介质300可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。可选地,计算机可读存储介质300包括非易失性计算机可读介质(non-transitory computer-readable storage medium)。计算机可读存储介质300具有执行上述方法中的任何方法步骤的程序代码310的存储空间。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。程序代码310可以例如以适当形式进行压缩。
综上所述,本申请实施例提供的句式挖掘方法、装置、电子设备以及存储介质,获取多条待挖掘语料,对多条待挖掘语料进行双序列比对,获得多条待挖掘语料对应的多个通用句式,对多个通用句式进行过滤,从多个通用句式中筛选出符合指定标准的通用句式作为标准句式,从而通过对待挖掘语料进行双序列比对获得通用句式,再对通用句式进行过滤获得标准句式,以快速便捷的从待挖掘语料中获得标准句式以供处理。
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不驱使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种句式挖掘方法,其特征在于,所述方法包括:
    获取多条待挖掘语料;
    对所述多条待挖掘语料进行双序列比对,获得所述多条待挖掘语料对应的多个通用句式;
    对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式,包括:
    获取所述多个通用句式之间的句式包含关系,并获取所述多个通用句式中的每个通用句式的句式复杂度;
    基于所述多个通用句式之间的句式包含关系和所述每个通用句式的句式复杂度对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式。
  3. 根据权利要求2所述的方法,其特征在于,所述从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式,包括:
    从所述多个通用句式中筛选出与其他通用句式之间的句式包含关系满足第一指定标准,且句式复杂度满足第二指定标准的通用句式作为标准句式。
  4. 根据权利要求2所述的方法,其特征在于,所述基于所述多个通用句式之间的句式包含关系和所述每个通用句式的句式复杂度对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式,包括
    基于所述多个通用句式之间的句式包含关系,获取所述多个通用句式中的每个通用句式的图入度;
    基于所述每个通用句式的图入度和所述每个通用句式的复杂度对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式。
  5. 根据权利要求4所述的方法,其特征在于,所述从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式,包括:
    从所述多个通用句式中筛选出图入度满足第三指定标准,且句式复杂度满足第二指定标准的通用句式作为标准句式。
  6. 根据权利要求5所述的方法,其特征在于,所述从所述多个通用句式中筛选出图入度满足第三指定标准,且句式复杂度满足第二指定标准的通用句式作为标准句式,包括:
    从所述多个通用句式中筛选出图入度大于指定图入度,且句式复杂度大于指定复杂度的通用句式作为标准句式。
  7. 根据权利要求2-6任一项所述的方法,其特征在于,所述获取所述多个通用句式中的每个通用句式的句式复杂度,包括:
    基于
    Figure PCTCN2020084769-appb-100001
    获取所述多个通用句式中的每个通用句式的句式复杂度,其中,n表示所述通用句式被分割的次数,t表示所述通用句式中的每个分隔段的字数。
  8. 根据权利要求1-7任一项所述的方法,其特征在于,所述对所述多条待挖掘语料进行双序列比对,获得所述多条待挖掘语料对应的多个通用句式,包括:
    获取所述多条待挖掘语料中的每条待挖掘语料的序列类型;
    基于所述每条待挖掘语料的序列类型,确定对所述多条待挖掘语料进行双序列比对的处理方式;
    基于所述处理方式对所述多条待挖掘语料进行双序列比对,获得所述多条待挖掘语料对应的多个通用句式。
  9. 根据权利要求8所述的方法,其特征在于,所述基于所述每条待挖掘语料的序列类 型,确定对所述多条待挖掘语料进行双序列比对的处理方式,包括:
    基于所述每条待挖掘语料的序列类型,从全局比对和局部比对中确定对所述多条待挖掘语料进行双序列比对的处理方式。
  10. 根据权利要求9所述的方法,其特征在于,所述基于所述每条待挖掘语料的序列类型,从全局比对和局部比对中确定对所述多条待挖掘语料进行双序列比对的处理方式,包括:
    基于所述每条待挖掘语料的序列类型,获取所述多条待挖掘语料之间的序列相似度;
    基于所述多条待挖掘语料之间的序列相似度,从所述全局比对和所述局部比对中确定对所述多条待挖掘语料进行双序列比对的处理方式。
  11. 根据权利要求10所述的方法,其特征在于,所述基于所述多条待挖掘语料之间的序列相似度,从所述全局比对和所述局部比对中确定对所述多条待挖掘语料进行双序列比对的处理方式,包括:
    当所述多条待挖掘语料之间的序列相似度大于指定相似度时,将所述全局比对确定为对所述多条待挖掘语料进行双序列比对的处理方式;
    当所述多条待挖掘语料之间的序列相似度不大于指定相似度时,将所述局部比对确定为对所述多条待挖掘语料进行双序列比对的处理方式。
  12. 根据权利要求9-11任一项所述的方法,其特征在于,所述全局比对包括Needleman–Wunsch算法。
  13. 根据权利要求9-12任一项所述的方法,其特征在于,所述局部比对包括Smith–Waterman算法。
  14. 根据权利要求1-13任一项所述的方法,其特征在于,所述对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式,包括:
    输出所述标准句式。
  15. 根据权利要求14所述的方法,其特征在于,所述输出所述标准句式,包括:
    当所述标准句式为询问句式时,基于所述标准句式获取标准答复句式;
    输出所述标准句式和所述标准答复句式。
  16. 根据权利要求1-15任一项所述的方法,其特征在于,所述获取多条待挖掘语料之前,还包括:
    获取训练数据集,所述训练数据集包括多个语料和多个标准句式;
    基于所述训练数据集,将每个语料作为输入数据,以及每个标准句式作为输出数据,通过机器学习算法进行训练,获得已训练的句式挖掘模型。
  17. 一种句式挖掘装置,其特征在于,所述装置包括:
    待挖掘语料获取模块,用于获取多条待挖掘语料;
    通用句式获得模块,用于对所述多条待挖掘语料进行双序列比对,获得所述多条待挖掘语料对应的多个通用句式;
    标准句式获得模块,用于对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式。
  18. 根据权利要求17所述的装置,其特征在于,所述标准句式获得模块,包括:
    信息获取子模块,用于获取所述多个通用句式之间的句式包含关系,并获取所述多个通用句式中的每个通用句式的句式复杂度;
    标准句式获得子模块,用于基于所述多个通用句式之间的句式包含关系和所述每个通用句式的句式复杂度对所述多个通用句式进行过滤,从所述多个通用句式中筛选出符合指定标准的通用句式作为标准句式。
  19. 一种电子设备,其特征在于,包括存储器和处理器,所述存储器耦接到所述处理器,所述存储器存储指令,当所述指令由所述处理器执行时所述处理器执行如权利要求1-16任一项所述的方法。
  20. 一种计算机可读取存储介质,其特征在于,所述计算机可读取存储介质中存储有程序代码,所述程序代码可被处理器调用执行如权利要求1-16任一项所述的方法。
PCT/CN2020/084769 2020-04-14 2020-04-14 句式挖掘方法、装置、电子设备以及存储介质 WO2021207939A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2020/084769 WO2021207939A1 (zh) 2020-04-14 2020-04-14 句式挖掘方法、装置、电子设备以及存储介质
CN202080094177.6A CN115039105A (zh) 2020-04-14 2020-04-14 句式挖掘方法、装置、电子设备以及存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/084769 WO2021207939A1 (zh) 2020-04-14 2020-04-14 句式挖掘方法、装置、电子设备以及存储介质

Publications (1)

Publication Number Publication Date
WO2021207939A1 true WO2021207939A1 (zh) 2021-10-21

Family

ID=78083707

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/084769 WO2021207939A1 (zh) 2020-04-14 2020-04-14 句式挖掘方法、装置、电子设备以及存储介质

Country Status (2)

Country Link
CN (1) CN115039105A (zh)
WO (1) WO2021207939A1 (zh)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101221558A (zh) * 2008-01-22 2008-07-16 安徽科大讯飞信息科技股份有限公司 句子模板自动提取的方法
CN106649294A (zh) * 2016-12-29 2017-05-10 北京奇虎科技有限公司 一种分类模型的训练及其从句识别方法和装置
CN107038163A (zh) * 2016-02-03 2017-08-11 常州普适信息科技有限公司 一种面向海量互联网信息的文本语义建模方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101221558A (zh) * 2008-01-22 2008-07-16 安徽科大讯飞信息科技股份有限公司 句子模板自动提取的方法
CN107038163A (zh) * 2016-02-03 2017-08-11 常州普适信息科技有限公司 一种面向海量互联网信息的文本语义建模方法
CN106649294A (zh) * 2016-12-29 2017-05-10 北京奇虎科技有限公司 一种分类模型的训练及其从句识别方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
REGINA BARZILAY, LILLIAN LEE: "Learning to paraphrase", PROCEEDINGS OF THE 2003 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS ON HUMAN LANGUAGE TECHNOLOGY , NAACL '03, ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, MORRISTOWN, NJ, USA, vol. 1, 1 January 2003 (2003-01-01) - 1 June 2003 (2003-06-01), Morristown, NJ, USA , pages 16 - 23, XP055158852, DOI: 10.3115/1073445.1073448 *

Also Published As

Publication number Publication date
CN115039105A (zh) 2022-09-09

Similar Documents

Publication Publication Date Title
US11455981B2 (en) Method, apparatus, and system for conflict detection and resolution for competing intent classifiers in modular conversation system
CN111159395B (zh) 基于图神经网络的谣言立场检测方法、装置和电子设备
CN108399428B (zh) 一种基于迹比准则的三元组损失函数设计方法
US11823074B2 (en) Intelligent communication manager and summarizer
WO2021169842A1 (zh) 数据更新方法、装置、电子设备及计算机可读存储介质
CN112016553B (zh) 光学字符识别(ocr)系统、自动ocr更正系统、方法
US20170185913A1 (en) System and method for comparing training data with test data
WO2021063089A1 (zh) 规则匹配方法、规则匹配装置、存储介质及电子设备
CN111462752B (zh) 基于注意力机制、特征嵌入及bi-lstm的客户意图识别方法
CN112347760A (zh) 意图识别模型的训练方法及装置、意图识别方法及装置
CN115129848A (zh) 一种视觉问答任务的处理方法、装置、设备和介质
CN114329034A (zh) 基于细粒度语义特征差异的图像文本匹配判别方法及系统
CN112671985A (zh) 基于深度学习的坐席质检方法、装置、设备及存储介质
CN112488003A (zh) 一种人脸检测方法、模型创建方法、装置、设备及介质
CN113392205A (zh) 用户画像构建方法、装置、设备及存储介质
CN112906391A (zh) 元事件抽取方法、装置、电子设备和存储介质
US20230244862A1 (en) Form processing method and apparatus, device, and storage medium
WO2021207939A1 (zh) 句式挖掘方法、装置、电子设备以及存储介质
CN115688868B (zh) 一种模型训练方法及计算设备
CN115169322A (zh) 基于知识图谱的自然语言数据挖掘语句求解方法及装置
CN114417860A (zh) 一种信息检测方法、装置及设备
CN110895924B (zh) 一种文档内容朗读方法、装置、电子设备及可读存储介质
US11875785B2 (en) Establishing user persona in a conversational system
CN113722496B (zh) 一种三元组抽取方法、装置、可读存储介质及电子设备
CN117236347B (zh) 交互文本翻译的方法、交互文本的显示方法和相关装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20930781

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205 DATED 13/03/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20930781

Country of ref document: EP

Kind code of ref document: A1