CN115039105A - Sentence pattern mining method, sentence pattern mining device, electronic equipment and storage medium - Google Patents

Sentence pattern mining method, sentence pattern mining device, electronic equipment and storage medium Download PDF

Info

Publication number
CN115039105A
CN115039105A CN202080094177.6A CN202080094177A CN115039105A CN 115039105 A CN115039105 A CN 115039105A CN 202080094177 A CN202080094177 A CN 202080094177A CN 115039105 A CN115039105 A CN 115039105A
Authority
CN
China
Prior art keywords
sentence
general
standard
mined
sentence pattern
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202080094177.6A
Other languages
Chinese (zh)
Inventor
李森林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Hefei Technology Co ltd
Original Assignee
Shenzhen Huantai Digital Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Huantai Digital Technology Co ltd filed Critical Shenzhen Huantai Digital Technology Co ltd
Publication of CN115039105A publication Critical patent/CN115039105A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A sentence pattern mining method, a sentence pattern mining device, electronic equipment and a storage medium relate to the technical field of electronic equipment. The method comprises the following steps: obtaining a plurality of corpora to be mined (S110), performing double-sequence comparison on the plurality of corpora to be mined to obtain a plurality of general sentence patterns corresponding to the plurality of corpora to be mined (S120), filtering the plurality of general sentence patterns, and screening the general sentence patterns meeting specified standards from the plurality of general sentence patterns as standard sentence patterns (S130). The method obtains the general sentence pattern by performing double-sequence comparison on the linguistic data to be mined, and then filters the general sentence pattern to obtain the standard sentence pattern so as to quickly and conveniently obtain the standard sentence pattern from the linguistic data to be mined for processing.

Description

Sentence pattern mining method, sentence pattern mining device, electronic equipment and storage medium Technical Field
The present application relates to the field of electronic device technologies, and in particular, to a sentence mining method, apparatus, electronic device, and storage medium.
Background
In actual internet business, a large amount of formatted information is often accessible, and how to effectively process the structured information analysis through universal sentence mining becomes one of the directions of attention of many natural language processing researchers.
Disclosure of Invention
In view of the above problems, the present application provides a sentence mining method, apparatus, electronic device and storage medium to solve the above problems.
In a first aspect, an embodiment of the present application provides a sentence mining method, where the method includes: acquiring a plurality of linguistic data to be excavated; performing double-sequence comparison on the plurality of linguistic data to be excavated to obtain a plurality of general sentence patterns corresponding to the plurality of linguistic data to be excavated; and filtering the plurality of general sentence patterns, and screening out the general sentence patterns meeting the specified standard from the plurality of general sentence patterns as standard sentence patterns.
In a second aspect, an embodiment of the present application provides a sentence excavating device, where the device includes: the corpus to be excavated acquiring module is used for acquiring a plurality of corpuses to be excavated; a general sentence pattern obtaining module, configured to perform double-sequence comparison on the multiple corpora to be mined, so as to obtain multiple general sentence patterns corresponding to the multiple corpora to be mined; and the standard sentence pattern obtaining module is used for filtering the plurality of general sentence patterns and screening out the general sentence patterns meeting the specified standard from the plurality of general sentence patterns as the standard sentence patterns.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory and a processor, where the memory is coupled to the processor, and the memory stores instructions, and when the instructions are executed by the processor, the processor executes the method described above.
In a fourth aspect, the present application provides a computer-readable storage medium, in which a program code is stored, and the program code can be called by a processor to execute the above method.
The sentence pattern mining method, the sentence pattern mining device, the electronic equipment and the storage medium, provided by the embodiment of the application, are used for obtaining a plurality of linguistic data to be mined, performing double-sequence comparison on the plurality of linguistic data to be mined, obtaining a plurality of general sentence patterns corresponding to the plurality of linguistic data to be mined, filtering the plurality of general sentence patterns, screening out the general sentence patterns meeting specified standards from the plurality of general sentence patterns as standard sentence patterns, obtaining the general sentence patterns by performing double-sequence comparison on the linguistic data to be mined, and then filtering the general sentence patterns to obtain the standard sentence patterns, so that the standard sentence patterns can be quickly and conveniently obtained from the linguistic data to be mined for processing.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 is a schematic flow chart diagram illustrating a sentence mining method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart diagram illustrating a sentence mining method according to yet another embodiment of the present application;
FIG. 3 is a diagram illustrating a sentence containment relationship between a plurality of general sentences provided in an embodiment of the present application;
FIG. 4 is a flow chart diagram illustrating step S240 of the sentence mining method illustrated in FIG. 2 of the present application;
FIG. 5 is a schematic flow chart diagram illustrating a sentence mining method according to yet another embodiment of the present application;
FIG. 6 is a flowchart illustrating step S330 of the sentence mining method illustrated in FIG. 5 of the present application;
FIG. 7 is a flowchart illustrating step S332 of the sentence mining method illustrated in FIG. 6 of the present application;
FIG. 8 is a schematic flow chart diagram illustrating a sentence mining method according to another embodiment of the present application;
FIG. 9 is a flowchart illustrating step S440 of the sentence mining method illustrated in FIG. 8 of the present application;
FIG. 10 is a schematic flow chart diagram illustrating a sentence mining method according to yet another embodiment of the present application;
FIG. 11 is a block diagram illustrating a sentence mining apparatus according to an embodiment of the present application;
FIG. 12 is a block diagram illustrating an electronic device for performing a sentence mining method according to an embodiment of the present application;
fig. 13 illustrates a storage unit for storing or carrying program code implementing the schema mining method according to the embodiment of the present application.
Detailed Description
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.
In recent years, with the rapid development of Artificial Intelligence (AI) related technologies, more and more application scenarios, such as Computer Vision (CV) and natural speech processing (NLP), have been implemented, which greatly improve people's clothes and eating habits. In particular, the enthusiasm of researchers in recent years has led to the development of relevant language models, such as the transform model based on pure attention mechanism, the bert (bidirectional encoder representation from transforms) model based on transform model, etc., which are recently researched and developed. In actual internet services, a large amount of user-formatted information can be frequently accessed, and how to effectively process the structured information through general sentence mining is one of the directions in which a plurality of NLP researchers pay attention to the analysis of corresponding NLP downstream tasks (such as intelligent customer service, community question and answer, short text classification and the like) is facilitated.
Generally, current sentence mining methods can be classified into the following two categories:
(1) manually mining a regular expression: and manually analyzing the formatted data to find a general format of the related sentence pattern, and generating a regular expression for a downstream NLP task.
(2) Large scale language model based approach: a large amount of corpus training is utilized, and embedded expression of relevant fixed sentence patterns is obtained through large-scale language model (such as BERT) training.
The inventor finds that, for the manual mining regular expression, the accuracy can be guaranteed through the mode of summarizing the regular expression of the related sentence patterns in a manual discovery and arrangement mode, but the data in the intelligent customer service and community question and answer scenes conform to long-tailed distribution, a plurality of special sentence patterns can not be effectively mined, and the data volume is huge, and time and labor are wasted. For the large-scale language model, the field types of part of sentence patterns in the short text classification scene only depend on entity parts, such as what [ entity ], who [ entity ], and various [ entity ], and the classification of the problems cannot be well processed by the language model based on the neural network, so that the related sentence patterns are expected to be mined, and the problems are processed by integrating the sentence patterns and the [ entity ] verification; and the neural network-based language model experiment has extremely high cost and long calculation period, and is not suitable for medium and small enterprises which have a large amount of corpus data and hope to quickly iterate and land.
In view of the above problems, the inventors have found through long-term research and provide a sentence pattern mining method, apparatus, electronic device and storage medium provided in the embodiments of the present application, and obtain a general sentence pattern by performing double sequence comparison on a corpus to be mined, and then filter the general sentence pattern to obtain a standard sentence pattern, so as to quickly and conveniently obtain the standard sentence pattern from the corpus to be mined for processing. The specific sentence mining method is explained in detail in the following embodiments.
Referring to fig. 1, fig. 1 is a schematic flow chart illustrating a sentence mining method according to an embodiment of the present application. The sentence pattern mining method is used for obtaining a general sentence pattern by performing double-sequence comparison on the linguistic data to be mined, and then filtering the general sentence pattern to obtain a standard sentence pattern so as to quickly and conveniently obtain the standard sentence pattern from the linguistic data to be mined for processing. In the specific embodiment, the sentence pattern mining method is applied to the sentence pattern mining device 200 shown in fig. 11 and the electronic device 100 (fig. 12) equipped with the sentence pattern mining device 200. The specific process of the present embodiment will be described below by taking an electronic device as an example, where the electronic device applied in the present embodiment may include a mobile terminal, a tablet computer, a desktop computer, a wearable electronic device, and the like, which is not limited herein. As will be explained in detail with respect to the flow shown in fig. 1, the sentence mining method may specifically include the following steps:
step S110: and acquiring a plurality of linguistic data to be excavated.
In this embodiment, a plurality of corpora to be mined may be obtained. In some embodiments, the plurality of corpora to be mined may be obtained from the community question and answer, may be obtained from a short text, may also be obtained from part of the community question and answer, and another part is obtained from the short text, and the like, which is not limited herein.
In some embodiments, the plurality of corpora to be mined may be obtained from a server, for example, from a community question and answer or a short text recorded in the server, and the plurality of corpora to be mined may also be obtained from other electronic devices, for example, from a community question and answer or a short text recorded in other electronic devices, where when the plurality of corpora to be mined are obtained from the server or other electronic devices, the plurality of corpora to be mined may be obtained from the server or other electronic devices through a wireless network or a data network.
In some embodiments, for example, the plurality of corpora to be mined are obtained from the community question-answer, the "chestnut white face stinger is a bird living in which country" may be obtained from the community question-answer as the corpora to be mined, the "alvine is a city in which country" may be obtained from the community question-answer as the corpora to be mined, and the like, which is not limited herein.
Step S120: and carrying out double-sequence comparison on the plurality of linguistic data to be excavated to obtain a plurality of general sentence patterns corresponding to the plurality of linguistic data to be excavated.
In this embodiment, after obtaining the plurality of corpuses to be mined, a dual sequence alignment (pair alignment) may be performed on the plurality of corpuses to be mined, so as to obtain a plurality of general sentences corresponding to the plurality of corpuses to be mined. The research method is to design a targeted effective algorithm to compare two DNA or protein sequences, find out the maximum similarity match between the two DNA or protein sequences and further judge whether the two DNA or protein sequences have homology. In this embodiment, a double-sequence comparison method is used to process a plurality of corpora to be mined to obtain a maximum similar matching sentence pattern between the plurality of corpora to be mined, that is, a plurality of general sentence patterns corresponding to the plurality of corpora to be mined, so that the sentence pattern learning is migrated by introducing a double-sequence comparison algorithm in bioinformatics, and the matching sentence pattern can be performed in byte units, thereby avoiding errors caused by semantic segmentation errors and artificial spelling errors in the conventional segmentation method. In some embodiments, after obtaining the plurality of corpuses to be excavated, two of the plurality of corpuses to be excavated may be subjected to double sequence comparison to obtain a plurality of general sentence patterns corresponding to the plurality of corpuses to be excavated.
For example, taking the example that the plurality of corpora to be mined include "a bird in which chestnuts and white-face stingers live" and "a city in which alevin is" as an example, the two-sequence alignment is performed on the "a bird in which chestnuts and white-face stingers live" and the "a city in which alevin is" to be mined "to obtain the plurality of corpora to be mined, which are the general sentence expressions: (. +. For another example, taking the multiple corpora to be mined including "how long it takes for the adult city train to go to beijing" and "how long it takes for the adult city plane to go to beijing" as an example, the two sequence comparisons are performed to obtain the multiple universal sentences of the multiple corpora to be mined: how long it took to get to beijing.
Step S130: and filtering the plurality of general sentence patterns, and screening out the general sentence patterns meeting the specified standard from the plurality of general sentence patterns as standard sentence patterns.
A large number of general sentence patterns are generally extracted by performing double-sequence comparison on a plurality of linguistic data to be mined, so that the sentence patterns with certain likeness meanings and certain generalization capability can be mined by adopting a quantitative mechanism. In this embodiment, after performing a double sequence comparison on a plurality of linguistic data to be mined to obtain a plurality of general sentence patterns, the plurality of general sentence patterns may be filtered to screen out a general sentence pattern meeting a specified standard from the plurality of general sentence patterns as a standard sentence pattern, wherein the general sentence pattern meeting the specified standard may refer to a sentence pattern having a certain elephant meaning and a certain generalization ability, so as to measure the generalization degree and elephant meaning of the standard sentence pattern with a quantified index, so as to make the standard sentence pattern obtained by mining from the plurality of linguistic data to be mined more accurate.
In some embodiments, the general sentence pattern filtering rule may be preset and stored, and after obtaining a plurality of general sentence patterns corresponding to a plurality of corpora to be mined, the plurality of general sentence patterns may be filtered based on the general sentence pattern filtering rule, so as to screen out a general sentence pattern meeting a specified standard from the plurality of general sentence patterns as a standard sentence pattern. As one mode, after obtaining a plurality of general sentence patterns corresponding to a plurality of corpora to be mined, it may be sequentially determined whether the plurality of general sentence patterns satisfy a general sentence pattern filtering rule, and a general sentence pattern meeting a specified standard is selected from the plurality of general sentence patterns as a standard sentence pattern according to the determination result.
The sentence pattern mining method provided by one embodiment of the application obtains a plurality of linguistic data to be mined, performs double sequence comparison on the plurality of linguistic data to be mined, obtains a plurality of general sentence patterns corresponding to the plurality of linguistic data to be mined, filters the plurality of general sentence patterns, and screens out the general sentence patterns meeting specified standards from the plurality of general sentence patterns as standard sentence patterns, so that the general sentence patterns are obtained by performing double sequence comparison on the linguistic data to be mined, and then the general sentence patterns are filtered to obtain the standard sentence patterns, so that the standard sentence patterns are quickly and conveniently obtained from the linguistic data to be mined for processing.
Referring to fig. 2, fig. 2 is a flow chart illustrating a sentence mining method according to another embodiment of the present application. As will be explained in detail with respect to the flow shown in fig. 2, the sentence mining method may specifically include the following steps:
step S210: and acquiring a plurality of linguistic data to be excavated.
Step S220: and carrying out double-sequence comparison on the plurality of linguistic data to be excavated to obtain a plurality of general sentence patterns corresponding to the plurality of linguistic data to be excavated.
For the detailed description of steps S210 to S220, refer to steps S110 to S120, which are not described herein again.
Step S230: and acquiring the sentence pattern inclusion relation among the plurality of general sentence patterns, and acquiring the sentence pattern complexity of each general sentence pattern in the plurality of general sentence patterns.
In this embodiment, after obtaining the plurality of general sentences, the sentence inclusion relationship between the plurality of general sentences may be obtained. In some embodiments, after obtaining the plurality of general sentences, the sentence inclusion relationship between the plurality of general sentences may be obtained based on the sample coverage of the plurality of general sentences, and specifically, after obtaining the plurality of general sentences, the parent-child node relationship may be divided based on the sample coverage of the plurality of general sentences, the sentence with the largest coverage is set as the parent node, and the child nodes in different levels from top to bottom are divided according to the sample coverage of the remaining general sentences from large to small, that is, the generalization ability of the parent node is the largest but it does not have a certain meaning, and the generalization ability of the child nodes in different levels from top to bottom is sequentially reduced but its meaning is sequentially increased.
Referring to fig. 3, fig. 3 is a diagram illustrating a sentence inclusion relationship among a plurality of general sentences according to an embodiment of the present application. As shown in FIG. 3, the plurality of universal periods includes: general sentence pattern S 0 General sentence pattern S 1 General sentence pattern
Figure PCTCN2020084769-APPB-000001
General sentence pattern
Figure PCTCN2020084769-APPB-000002
General sentence pattern
Figure PCTCN2020084769-APPB-000003
General sentence pattern
Figure PCTCN2020084769-APPB-000004
General sentence pattern
Figure PCTCN2020084769-APPB-000005
… …, wherein, the general sentence pattern S 0 Coverage general sentence pattern
Figure PCTCN2020084769-APPB-000006
General sentence pattern
Figure PCTCN2020084769-APPB-000007
And general sentence pattern
Figure PCTCN2020084769-APPB-000008
General sentence pattern S 1 Coverage general sentence pattern
Figure PCTCN2020084769-APPB-000009
General sentence pattern
Figure PCTCN2020084769-APPB-000010
And general sentence pattern
Figure PCTCN2020084769-APPB-000011
General sentence pattern
Figure PCTCN2020084769-APPB-000012
Coverage general sentence pattern
Figure PCTCN2020084769-APPB-000013
And general sentence pattern
Figure PCTCN2020084769-APPB-000014
General sentence pattern
Figure PCTCN2020084769-APPB-000015
Coverage general sentence pattern
Figure PCTCN2020084769-APPB-000016
And general sentence pattern
Figure PCTCN2020084769-APPB-000017
General sentence pattern
Figure PCTCN2020084769-APPB-000018
Coverage general sentence pattern
Figure PCTCN2020084769-APPB-000019
Therefore, the general sentence pattern S can be used 0 And general sentence pattern S 1 Determining a common sentence pattern as a parent node
Figure PCTCN2020084769-APPB-000020
General sentence pattern
Figure PCTCN2020084769-APPB-000021
General sentence pattern
Figure PCTCN2020084769-APPB-000022
General sentence pattern
Figure PCTCN2020084769-APPB-000023
General sentence pattern
Figure PCTCN2020084769-APPB-000024
… … are determined to be child nodes.
In the present embodiment, after the plurality of general patterns are obtained, the pattern complexity of each of the plurality of general patterns may be obtained. The more the complexity of the general sentence pattern is, the more complicated the general sentence pattern is characterized and the more has the meaning, and the less the complexity of the general sentence pattern is, the simpler the general sentence pattern is characterized and the less has the meaning. In some embodiments, may be based on
Figure PCTCN2020084769-APPB-000025
A sentence complexity of each of a plurality of general sentences is obtained, where n denotes the number of times the general sentence is divided and t denotes the number of words of each divided segment in the general sentence, e.g., the sentence complexity of which country (+?
Figure PCTCN2020084769-APPB-000026
Step S240: and filtering the plurality of general sentence patterns based on the sentence pattern inclusion relationship among the plurality of general sentence patterns and the sentence pattern complexity of each general sentence pattern, and screening out the general sentence patterns meeting the specified standard from the plurality of general sentence patterns as standard sentence patterns.
In the embodiment, after the sentence pattern inclusion relationship between the plurality of general sentence patterns and the sentence pattern complexity of each general sentence pattern are obtained, the plurality of general sentence patterns can be filtered based on the sentence pattern inclusion relationship between the plurality of general sentence patterns and the sentence pattern complexity of each general sentence pattern, so that the general sentence pattern meeting the specified standard is screened out from the plurality of general sentence patterns as the standard sentence pattern. It can be understood that the universal sentence patterns meeting the specified standard, which are selected from the plurality of universal sentence patterns, can have a certain generalization capability and a certain meaning according to the requirement.
In some embodiments, if the requirement is to select a general sentence with a strong generalization capability and a weak similarity, the general sentences may be filtered based on the sentence inclusion relationship between the general sentences and the sentence complexity of each general sentence to select the general sentence with a large sample coverage and a small sentence complexity as the standard sentence.
In some embodiments, if the requirement is to select a general sentence pattern with a weak generalization capability and a strong likeness, the general sentence patterns may be filtered based on the sentence pattern inclusion relationship between the general sentence patterns and the sentence pattern complexity of each general sentence pattern to select the general sentence pattern with a small sample coverage and a large sentence pattern complexity as the standard sentence pattern.
In some embodiments, if the requirement is to select a general sentence pattern with a certain generalization capability and a certain meaning, the general sentence patterns may be filtered based on the sentence pattern inclusion relationship between the general sentence patterns and the sentence pattern complexity of each general sentence pattern, so as to select a general sentence pattern as the standard sentence pattern, in which the sentence pattern inclusion relationship with other general sentence patterns satisfies a first specified criterion and the sentence pattern complexity satisfies a second specified criterion. The first specified criterion may be preset and stored as a criterion for determining the sentence pattern inclusion relationship between a certain general sentence pattern and other general sentence patterns, so that after obtaining the sentence pattern inclusion relationship between the certain general sentence pattern and other general sentence patterns, the sentence pattern inclusion relationship between the certain general sentence pattern and other general sentence patterns may be compared with the first specified criterion to determine whether the sentence pattern inclusion relationship between the certain general sentence pattern and other general sentence patterns satisfies the first specified criterion. The second specification standard may be preset and stored as a criterion for determining the sentence pattern complexity of each general sentence pattern, so that after the sentence pattern complexity of each general sentence pattern is obtained, the sentence pattern complexity of each general sentence pattern may be compared with the second specification standard model to determine whether the sentence pattern complexity of each general sentence pattern satisfies the second specification standard.
Referring to fig. 4, fig. 4 is a flowchart illustrating step S240 of the sentence mining method illustrated in fig. 2 of the present application. As will be described in detail with respect to the flow shown in fig. 4, the method may specifically include the following steps:
step S241: and acquiring the in-degree of each general sentence pattern in the general sentence patterns based on the sentence pattern inclusion relationship among the general sentence patterns.
In this embodiment, after obtaining the sentence inclusion relationship among the plurality of general sentences, the drawing-in degree of each general sentence in the plurality of general sentences may be obtained based on the sentence inclusion relationship among the plurality of general sentences. In some embodiments, after obtaining the sentence inclusion relationship between the plurality of general sentences, the in-picture score of each general sentence in the plurality of general sentences may be obtained based on the sentence inclusion relationship between the plurality of general sentences
Figure PCTCN2020084769-APPB-000027
Wherein, drawing degree
Figure PCTCN2020084769-APPB-000028
Reflects the generalization ability of the universal sentence pattern to a certain extent, as shown in FIG. 3, the universal sentence pattern in the plurality of universal sentence patterns
Figure PCTCN2020084769-APPB-000029
Degree of drawing of
Figure PCTCN2020084769-APPB-000030
General sentence pattern of multiple general sentence patterns
Figure PCTCN2020084769-APPB-000031
Degree of drawing of
Figure PCTCN2020084769-APPB-000032
General sentence explaining
Figure PCTCN2020084769-APPB-000033
Generalized ability of
Figure PCTCN2020084769-APPB-000034
Has strong generalization ability.
Step S242: and filtering the plurality of general sentence patterns based on the drawing-in degree of each general sentence pattern and the complexity of each general sentence pattern, and screening out the general sentence patterns meeting the specified standard from the plurality of general sentence patterns as standard sentence patterns.
In the embodiment, after obtaining the drawing income degree of each general sentence pattern and the sentence pattern complexity of each general sentence pattern, the general sentence patterns can be filtered based on the drawing income degree of each general sentence pattern and the sentence pattern complexity of each general sentence pattern, so as to screen out the general sentence patterns meeting the specified standard from the general sentence patterns as the standard sentence patterns. It can be understood that the universal sentence patterns meeting the specified standard, which are selected from the plurality of universal sentence patterns, can have a certain generalization capability and a certain meaning according to the requirement.
In some embodiments, if the requirement is to screen out general sentences with strong generalization ability and weak meaning, then multiple general sentences may be filtered based on the income of each general sentence and the sentence complexity of each general sentence to screen out general sentences with larger income and smaller sentence complexity as standard sentences.
In some embodiments, if the requirement is to select a general sentence with a weak generalization capability and a strong similarity meaning, the general sentences may be filtered based on the chart entry degree of each general sentence and the sentence complexity of each general sentence to select the general sentence with a small chart entry degree and a large sentence complexity as the standard sentence.
In some embodiments, if the requirement is to select a general sentence pattern having a certain generalization capability and a certain meaning, the general sentence patterns may be filtered based on the in-degree of each general sentence pattern and the complexity of the general sentence pattern, so as to select a general sentence pattern from the general sentence patterns, wherein the in-degree of the general sentence pattern satisfies the third specified criterion and the complexity of the general sentence pattern satisfies the second specified criterion, as the standard sentence pattern. The third specified criterion may be preset and stored as a criterion for determining the drawing-in degree of the general sentence pattern, so that after the drawing-in degree of the general sentence pattern is obtained, the drawing-in degree of the general sentence pattern may be compared with the third specified criterion to determine whether the drawing-in degree of the general sentence pattern satisfies the third specified criterion.
In some embodiments, a designated figure in-degree may be set and stored in advance, the designated figure in-degree being used as a criterion for the figure in-degree of each general formula, wherein the figure in-degree of the general formula may be determined to satisfy a third designated criterion when the figure in-degree of the general formula is greater than the designated figure in-degree, and the figure in-degree of the general formula may be determined not to satisfy the third designated criterion when the figure in-degree of the general formula is not greater than the designated figure in-degree. A specific complexity may be preset and stored, the specific complexity being used as a criterion for determining the complexity of each general sentence pattern, wherein when the complexity of the general sentence pattern is greater than the specific complexity, it may be determined that the complexity of the general sentence pattern satisfies a second specific criterion, and when the complexity of the general sentence pattern is not greater than the specific complexity, it may be determined that the complexity of the general sentence pattern does not satisfy the second specific criterion. Therefore, in this embodiment, based on the specified figure-in-degree and the specified complexity, a general formula with a figure-in-degree greater than the specified figure-in-degree and a formula complexity greater than the specified complexity can be selected from the plurality of general formulas as the standard formula, so that the obtained standard formula has a certain generalization capability and a certain meaning.
In another embodiment of the disclosure, a sentence pattern mining method is provided, where multiple linguistic data to be mined are obtained, a subsequence sequence comparison is performed on the multiple linguistic data to be mined, multiple general sentence patterns of the multiple linguistic data to be mined are obtained, a sentence pattern inclusion relation among the multiple general sentence patterns is obtained, a sentence pattern complexity of each general sentence pattern in the multiple general sentence patterns is obtained, the multiple general sentence patterns are filtered based on the sentence pattern inclusion relation among the multiple general sentence patterns and the sentence pattern complexity of each general sentence pattern, and the general sentence pattern meeting a specified standard is screened from the multiple general sentence patterns to serve as a standard sentence pattern. Compared with the sentence pattern mining method shown in fig. 1, in the embodiment, the standard sentence pattern is obtained by obtaining the sentence pattern inclusion relationship between the plurality of general sentence patterns and the sentence pattern complexity of each general sentence pattern to filter the plurality of general sentence patterns, so as to improve the accuracy of the obtained standard sentence pattern.
Referring to fig. 5, fig. 5 is a schematic flow chart diagram illustrating a sentence mining method according to still another embodiment of the present application. As will be described in detail with respect to the flow shown in fig. 5, the sentence mining method may specifically include the following steps:
step S310: and acquiring a plurality of linguistic data to be excavated.
For details of step S310, please refer to step S110, which is not repeated herein.
Step S320: and acquiring the sequence type of each corpus to be mined in the plurality of corpora to be mined.
In some embodiments, the double sequence alignment may include a global alignment that aligns each of the remaining portions of each of the universal periods, typically for cases where the sequence types are similar or the sequences are approximately the same length, and a local alignment that is more suitable for cases where the sequence types are less similar.
In this embodiment, in order to select a more suitable method from the global comparison and the local comparison to perform the double-sequence comparison on the multiple corpora to be mined, the sequence type of each corpus to be mined in the multiple corpora to be mined may be obtained.
Step S330: and determining a processing mode of performing double-sequence comparison on the plurality of linguistic data to be mined based on the sequence type of each linguistic data to be mined.
In this embodiment, after the sequence type of each corpus to be mined is obtained, a processing mode for performing double-sequence comparison on a plurality of corpora to be mined may be determined based on the sequence type of each corpus to be mined. In some embodiments, after the sequence type of each corpus to be mined is obtained, a processing mode of performing double-sequence comparison on a plurality of corpuses to be mined may be determined from the global comparison and the local comparison based on the sequence type of each corpus to be mined.
Referring to fig. 6, fig. 6 is a flowchart illustrating step S330 of the sentence mining method illustrated in fig. 5 of the present application. As will be explained in detail with respect to the flow shown in fig. 6, the method may specifically include the following steps:
step S331: and acquiring the sequence similarity among the corpora to be excavated based on the sequence type of each corpus to be excavated.
In some embodiments, after the sequence type of each corpus to be mined is obtained, the sequence similarity between a plurality of corpora to be mined may be obtained based on the sequence type of each corpus to be mined. As a way, after the sequence type of each corpus to be mined is obtained, the sequence types of multiple corpora to be mined can be matched to obtain the sequence similarity between the multiple corpora to be mined.
Step S332: and determining a processing mode of performing double-sequence comparison on the plurality of linguistic data to be mined from the global comparison and the local comparison based on the sequence similarity among the plurality of linguistic data to be mined.
In some embodiments, after obtaining the sequence similarity between the multiple corpora to be mined, a processing manner of performing double-sequence comparison on the multiple corpora to be mined may be determined from the global comparison and the local comparison based on the sequence similarity between the multiple corpora to be mined, that is, the global comparison is determined to be used as the processing manner of performing double-sequence comparison on the multiple corpora to be mined or the local comparison is determined to be used as the processing manner of performing double-sequence comparison on the multiple corpora to be mined based on the sequence similarity between the multiple corpora to be mined.
Referring to fig. 7, fig. 7 is a flowchart illustrating step S332 of the sentence mining method illustrated in fig. 6 of the present application. As will be described in detail with respect to the flow shown in fig. 7, the method may specifically include the following steps:
step S3321: and when the sequence similarity among the corpora to be mined is greater than the specified similarity, determining the global comparison as a processing mode of performing double-sequence comparison on the corpora to be mined.
In the present embodiment, when the sequence similarity between the corpora to be mined is greater than the specified similarity, the global alignment may be determined as a processing manner for performing double-sequence alignment on the corpora to be mined.
Step S3322: and when the sequence similarity among the corpora to be mined is not more than the specified similarity, determining the local comparison as a processing mode of performing double-sequence comparison on the corpora to be mined.
In this embodiment, when the sequence similarity between the corpora to be mined is not greater than the specified similarity, the local comparison may be determined as a processing mode of performing double-sequence comparison on the corpora to be mined.
Step S340: and performing double-sequence comparison on the plurality of linguistic data to be mined based on the processing mode to obtain a plurality of general sentence patterns corresponding to the plurality of linguistic data to be mined.
Step S350: and filtering the plurality of general sentence patterns, and screening out the general sentence patterns meeting the specified standard from the plurality of general sentence patterns as standard sentence patterns.
For detailed description of steps S340 to S350, please refer to steps S120 to S130, which are not described herein again.
In another embodiment of the sentence pattern mining method provided in this application, a plurality of corpora to be mined are obtained, a sequence type of each corpus to be mined in the plurality of corpora to be mined is obtained, a processing mode for performing double-sequence comparison on the plurality of corpora to be mined is determined based on the sequence type of each corpus to be mined, double-sequence comparison is performed on the plurality of corpora to be mined based on the processing mode, a plurality of general sentence patterns corresponding to the plurality of corpora to be mined are obtained, the plurality of general sentence patterns are filtered, and the general sentence patterns meeting a specified standard are screened out from the plurality of general sentence patterns as standard sentence patterns. Compared with the sentence pattern mining method shown in fig. 1, in the embodiment, based on the corpus type of each corpus to be mined, the adopted two-sequence comparison method is determined, so as to improve the accuracy of the obtained general sentence pattern.
Referring to fig. 8, fig. 8 is a schematic flow chart illustrating a sentence mining method according to another embodiment of the present application. As will be described in detail with respect to the flow shown in fig. 8, the sentence mining method may specifically include the following steps:
step S410: and acquiring a plurality of linguistic data to be excavated.
Step S420: and performing double-sequence comparison on the plurality of linguistic data to be mined to obtain a plurality of general sentence patterns corresponding to the plurality of linguistic data to be mined.
Step S430: and filtering the plurality of general sentence patterns, and screening out the general sentence patterns meeting the specified standard from the plurality of general sentence patterns as standard sentence patterns.
For detailed description of steps S410 to S430, please refer to steps S110 to S130, which are not described herein again.
Step S440: and outputting the standard sentence pattern.
In some embodiments, after the standard schema is obtained, the standard schema may be output to serve subsequent NLP downstream tasks. Based on this, the present embodiment may be used to assist in intent recognition: high-frequency question sentences/question methods are automatically mined from historical question-answer data of the users, so that analysts/product managers are assisted to quickly know user intentions, and labor cost is saved. Based on this, this embodiment can also be used to promote text classification model effect: in the short text classification task, part of sentence patterns are matched with entity information to effectively process classification texts depended by entities and serve as a priori/external knowledge embedding model. Based on this, the embodiment may also be used for the community question-answering task answer template: in the NLP question-answering task, a high frequency question-answering method of a user is found, and then an answer template sentence pattern (answers to some questions in a partially sensitive vertical domain question-answering method need to be limited to a certain sentence pattern, such as financial customer service) is prepared in a targeted manner, or a sentence pattern of Q and a is mined from a large-scale community question-answering (Q, a) pair, and a is arranged as an answer template of Q.
Referring to fig. 9, fig. 9 is a flowchart illustrating step S440 of the sentence mining method illustrated in fig. 8 of the present application. As will be described in detail with respect to the flow shown in fig. 9, the method may specifically include the following steps:
step S441: when the standard sentence pattern is a query sentence pattern, a standard reply sentence pattern is obtained based on the standard sentence pattern.
In some embodiments, a sentence pattern format of the determined standard sentence pattern may be identified, wherein the sentence pattern format may include a statement sentence pattern, a query sentence pattern, and the like, and in this embodiment, when the standard sentence pattern is identified as the query sentence pattern, a standard reply sentence pattern corresponding to the standard sentence pattern may be obtained based on the standard sentence pattern, wherein one standard sentence pattern may correspond to one standard reply sentence pattern, may correspond to a plurality of standard reply sentence patterns, and the like, which is not limited herein.
Step S442: and outputting the standard sentence pattern and the standard reply sentence pattern.
In some embodiments, after the standard sentence pattern and the standard reply sentence pattern are obtained, the standard sentence pattern and the standard reply sentence pattern may be output.
In another embodiment of the disclosure, a sentence pattern mining method is provided, in which a plurality of linguistic data to be mined are obtained, a double sequence comparison is performed on the plurality of linguistic data to be mined, a plurality of general sentence patterns corresponding to the plurality of linguistic data to be mined are obtained, the plurality of general sentence patterns are filtered, the general sentence patterns meeting a specified standard are screened out from the plurality of general sentence patterns and used as standard sentence patterns, and the standard sentence patterns are output. Compared with the sentence mining method shown in fig. 1, the embodiment also outputs the standard sentence for the corresponding downstream task to use, so as to improve the accurate response of the downstream task.
Referring to fig. 10, fig. 10 is a schematic flow chart diagram illustrating a sentence mining method according to yet another embodiment of the present application. As will be described in detail with respect to the flow shown in fig. 10, the sentence mining method may specifically include the following steps:
step S510: a training data set is obtained, wherein the training data set comprises a plurality of corpora and a plurality of standard sentence patterns.
The embodiment of the application further comprises a training method of the sentence mining model, wherein the training of the sentence mining model can be performed in advance according to the acquired training data set, and subsequently, mining processing can be performed according to the sentence mining model every time the sentence mining is performed, and the sentence mining model does not need to be trained every time the sentence mining is performed.
In some implementations, a training data set can be collected, where the training data set includes a plurality of corpora and a plurality of standard question sentences.
Step S520: and based on the training data set, taking each corpus as input data and each standard sentence pattern as output data, and training through a machine learning algorithm to obtain a trained sentence pattern mining model.
In the embodiment of the present application, a machine learning algorithm may be used to train the training data set, so as to obtain a sentence mining model. Wherein, the adopted machine learning algorithm can comprise: neural networks, Long Short-Term Memory (LSTM) networks, threshold cycle units, simple cycle units, autoencoders, decision trees, random forests, feature mean classifications, classification regression trees, hidden markov, K-nearest neighbor (KNN) algorithms, logistic regression models, bayesian models, gaussian models, and KL divergence (Kullback-Leibler) among others. The specific machine learning algorithm may not be limiting.
The training of the initial model based on the training data set is described below using a neural network as an example.
The corpus in a set of data in the training dataset is used as an input sample (input data) of the neural network, and the standard sentence pattern in a set of data is used as an output sample (output data) of the neural network. The neurons in the input layer are fully connected with the neurons in the hidden layer, and the neurons in the hidden layer are fully connected with the neurons in the output layer, so that potential features with different granularities can be effectively extracted. And the number of hidden layers can be multiple, so that the nonlinear relation can be better fitted, and the sentence pattern mining model obtained through training is more accurate.
It is understood that the training process for the sentence mining model may or may not be performed by electronic equipment. When the training process is not performed by the electronic device, then the electronic device may be used only as a direct user or may be an indirect user.
In some embodiments, the sentence mining model may be trained and updated by periodically or aperiodically acquiring new training data.
Step S530: and acquiring a plurality of linguistic data to be excavated.
Step S540: and performing double-sequence comparison on the plurality of linguistic data to be mined to obtain a plurality of general sentence patterns corresponding to the plurality of linguistic data to be mined.
Step S550: and filtering the plurality of general sentence patterns, and screening out the general sentence patterns meeting the specified standard from the plurality of general sentence patterns as standard sentence patterns.
For the detailed description of steps S530 to S540, please refer to steps S110 to S130, which are not described herein again.
In another embodiment of the disclosure, a sentence mining method is provided, in which a training data set is obtained, the training data set includes a plurality of corpora and a plurality of standard sentences, each corpus is used as input data and each standard sentence is used as output data based on the training data set, training is performed through a machine learning algorithm to obtain a trained sentence mining model, a plurality of corpora to be mined are obtained, a plurality of bi-sequence comparisons are performed on the plurality of corpora to be mined to obtain a plurality of universal sentences corresponding to the plurality of corpora to be mined, the plurality of universal sentences are filtered, and the universal sentences meeting a predetermined standard are screened out from the plurality of universal sentences to serve as the standard sentences. Compared with the sentence mining method shown in fig. 1, the embodiment further collects the training data set to train to obtain the sentence mining model to perform the standard sentence mining of the corpus, so as to improve the accuracy of the obtained standard sentence.
Referring to fig. 11, fig. 11 shows a block diagram of a sentence mining apparatus 200 according to an embodiment of the present application, and the following description will be made with respect to the block diagram shown in fig. 11, where the sentence mining apparatus 200 includes: a corpus acquiring module 210 to be mined, a general sentence pattern acquiring module 220, and a standard sentence pattern acquiring module 230, wherein:
the corpus to be mined obtaining module 210 is configured to obtain multiple corpora to be mined.
A general sentence pattern obtaining module 220, configured to perform double-sequence comparison on the multiple corpora to be mined, so as to obtain multiple general sentence patterns corresponding to the multiple corpora to be mined.
Further, the general sentence pattern obtaining module 220 includes: a sequence type obtaining submodule, a processing mode determining submodule and a general sentence pattern obtaining submodule, wherein:
and the sequence type obtaining submodule is used for obtaining the sequence type of each to-be-mined corpus in the plurality of to-be-mined corpora.
And the processing mode determining submodule is used for determining a processing mode for performing double-sequence comparison on the plurality of linguistic data to be mined based on the sequence type of each linguistic data to be mined.
Further, the processing mode determining sub-module includes: a processing manner determination unit, wherein:
and the processing mode determining unit is used for determining a processing mode for performing double-sequence comparison on the plurality of linguistic data to be mined from the global comparison and the local comparison based on the sequence type of each linguistic data to be mined.
Further, the processing manner determining unit includes: a sequence similarity obtaining subunit and a processing mode determining subunit, wherein:
and the sequence similarity obtaining subunit is configured to obtain the sequence similarity between the multiple corpora to be mined based on the sequence type of each corpus to be mined.
And the processing mode determining subunit is configured to determine, from the global comparison and the local comparison, a processing mode for performing a double-sequence comparison on the plurality of corpuses to be mined based on the sequence similarity between the plurality of corpuses to be mined.
Further, the processing manner determining subunit includes: a first processing mode determining subunit and a second processing mode determining subunit, wherein:
and the first processing mode determining subunit is used for determining the global comparison as a processing mode for performing double-sequence comparison on the plurality of linguistic data to be mined when the sequence similarity between the plurality of linguistic data to be mined is greater than a specified similarity.
And the second processing mode determining subunit is used for determining the local comparison as a processing mode for performing double-sequence comparison on the plurality of linguistic data to be excavated when the sequence similarity between the plurality of linguistic data to be excavated is not greater than the specified similarity.
And the general sentence pattern obtaining submodule is used for carrying out double-sequence comparison on the plurality of linguistic data to be excavated based on the processing mode to obtain a plurality of general sentence patterns corresponding to the plurality of linguistic data to be excavated.
A standard sentence pattern obtaining module 230, configured to filter the multiple general sentence patterns, and screen out a general sentence pattern meeting a specified standard from the multiple general sentence patterns as a standard sentence pattern.
Further, the standard schema obtaining module 230 includes: an information acquisition submodule and a standard sentence pattern acquisition submodule, wherein:
and the information acquisition submodule is used for acquiring the sentence pattern inclusion relation among the plurality of general sentence patterns and acquiring the sentence pattern complexity of each general sentence pattern in the plurality of general sentence patterns.
Further, the information acquisition sub-module includes: a sentence complexity obtaining unit, wherein:
a sentence pattern complexity obtaining unit for obtaining a complexity based on
Figure PCTCN2020084769-APPB-000035
A sentence complexity of each of the plurality of general sentences is obtained, where n denotes the number of times the general sentence is divided and t denotes the number of words of each divided segment in the general sentence.
And the standard sentence pattern obtaining submodule is used for filtering the plurality of general sentence patterns based on the sentence pattern inclusion relationship among the plurality of general sentence patterns and the sentence pattern complexity of each general sentence pattern, and screening out the general sentence patterns meeting the specified standard from the plurality of general sentence patterns as the standard sentence patterns.
Further, the standard sentence pattern obtaining submodule includes: a first standard sentence pattern obtaining unit, wherein:
and the first standard sentence pattern obtaining unit is used for screening out the general sentence patterns which satisfy the first specified standard and the second specified standard in sentence pattern complexity from the plurality of general sentence patterns as the standard sentence patterns.
Further, the standard sentence pattern obtaining submodule includes: an in-picture degree obtaining unit and a second standard sentence pattern obtaining unit, wherein:
an in-degree acquisition unit, configured to acquire an in-degree of each general sentence pattern in the plurality of general sentence patterns based on a sentence pattern inclusion relationship between the plurality of general sentence patterns.
And the second standard sentence pattern obtaining unit is used for filtering the plurality of general sentence patterns based on the drawing-in degree of each general sentence pattern and the complexity of each general sentence pattern, and screening out the general sentence patterns meeting the specified standard from the plurality of general sentence patterns as standard sentence patterns.
Further, the second standard sentence pattern obtaining unit includes: a standard sentence pattern obtaining subunit, wherein:
and the standard sentence pattern obtaining subunit is used for screening out a universal sentence pattern from the plurality of universal sentence patterns, wherein the drawing in degree meets a third specified standard, and the complexity of the sentence pattern meets a second specified standard, and the universal sentence pattern is used as the standard sentence pattern.
Further, the standard sentence pattern obtainment sub-single loop comprises: the standard sentence pattern obtains subunits, wherein:
and the standard sentence pattern obtaining subunit is used for screening out the general sentence patterns with the chart entry degree larger than the specified chart entry degree and the sentence pattern complexity larger than the specified complexity from the plurality of general sentence patterns as the standard sentence patterns.
Further, the sentence pattern digging device 200 further includes: a standard sentence pattern output module, wherein:
and the standard sentence pattern output module is used for outputting the standard sentence pattern.
Further, the standard sentence pattern output module comprises: a standard reply sentence pattern acquisition submodule and a standard sentence pattern output submodule, wherein:
and a standard reply sentence pattern obtaining sub-module for obtaining a standard reply sentence pattern based on the standard sentence pattern when the standard sentence pattern is the query sentence pattern.
And the standard sentence pattern output submodule is used for outputting the standard sentence pattern and the standard reply sentence pattern.
Further, the sentence pattern digging device 200 further includes: training data set acquisition module and sentence pattern excavation model training module, wherein:
and the training data set acquisition module is used for acquiring a training data set, and the training data set comprises a plurality of corpora and a plurality of standard sentence patterns.
And the sentence pattern mining model training module is used for training the fish by using a machine learning algorithm by taking each corpus as input data and each standard sentence pattern as output data based on the training data set to obtain a trained sentence pattern mining model.
It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the specific working processes of the above-described devices and modules may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, the coupling between the modules may be electrical, mechanical or other type of coupling.
In addition, functional modules in the embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode.
Referring to fig. 12, a block diagram of an electronic device 100 according to an embodiment of the present disclosure is shown. The electronic device 100 may be a smart phone, a tablet computer, an electronic book, or other electronic devices capable of running an application. The electronic device 100 in the present application may include one or more of the following components: a processor 110, a memory 120, and one or more applications, wherein the one or more applications may be stored in the memory 120 and configured to be executed by the one or more processors 110, the one or more programs configured to perform a method as described in the aforementioned method embodiments.
Processor 110 may include one or more processing cores, among other things. The processor 110 connects various parts within the overall electronic device 100 using various interfaces and lines, and performs various functions of the electronic device 100 and processes data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 120 and calling data stored in the memory 120. Alternatively, the processor 110 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 110 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing the content to be displayed; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 110, but may be implemented by a communication chip.
The Memory 120 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 120 may be used to store instructions, programs, code, sets of codes, or sets of instructions. The memory 120 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The data storage area may also store data created by the electronic device 100 during use (e.g., phone book, audio-video data, chat log data), and the like.
Referring to fig. 13, a block diagram of a computer-readable storage medium according to an embodiment of the present application is shown. The computer-readable medium 300 has stored therein a program code that can be called by a processor to execute the method described in the above-described method embodiments.
The computer-readable storage medium 300 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Alternatively, the computer-readable storage medium 300 includes a non-volatile computer-readable storage medium. The computer readable storage medium 300 has storage space for program code 310 for performing any of the method steps of the method described above. The program code can be read from and written to one or more computer program products. The program code 310 may be compressed, for example, in a suitable form.
To sum up, the sentence pattern mining method, apparatus, electronic device and storage medium provided in the embodiments of the present application obtain a plurality of corpora to be mined, perform a double-sequence comparison on the plurality of corpora to be mined, obtain a plurality of general sentence patterns corresponding to the plurality of corpora to be mined, filter the plurality of general sentence patterns, and screen out the general sentence patterns meeting a specified standard from the plurality of general sentence patterns as standard sentence patterns, so as to obtain the general sentence patterns by performing the double-sequence comparison on the corpora to be mined, and then filter the general sentence patterns to obtain the standard sentence patterns, so as to quickly and conveniently obtain the standard sentence patterns from the corpora to be mined for processing.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (20)

  1. A sentence pattern mining method is characterized by comprising the following steps:
    acquiring a plurality of linguistic data to be excavated;
    performing double-sequence comparison on the plurality of linguistic data to be mined to obtain a plurality of general sentence patterns corresponding to the plurality of linguistic data to be mined;
    and filtering the plurality of general sentence patterns, and screening out the general sentence patterns meeting the specified standard from the plurality of general sentence patterns as standard sentence patterns.
  2. The method of claim 1, wherein said filtering said plurality of common sentences and selecting from said plurality of common sentences a common sentence pattern meeting a specified criteria as a standard sentence pattern comprises:
    acquiring sentence pattern inclusion relations among the plurality of general sentence patterns, and acquiring sentence pattern complexity of each general sentence pattern in the plurality of general sentence patterns;
    and filtering the plurality of general sentence patterns based on the sentence pattern inclusion relationship among the plurality of general sentence patterns and the sentence pattern complexity of each general sentence pattern, and screening out the general sentence patterns meeting the specified standard from the plurality of general sentence patterns as standard sentence patterns.
  3. The method of claim 2, wherein said screening out a common sentence pattern from said plurality of common sentence patterns that meets a specified criteria as a standard sentence pattern comprises:
    and screening out the general sentence patterns which satisfy a first specified standard with the sentence pattern inclusion relation among other general sentence patterns and satisfy a second specified standard in sentence pattern complexity from the plurality of general sentence patterns as standard sentence patterns.
  4. The method of claim 2, wherein the filtering the plurality of general sentence patterns based on the sentence pattern inclusion relationship between the plurality of general sentence patterns and the sentence pattern complexity of each general sentence pattern, and the filtering out the general sentence patterns meeting a specified standard from the plurality of general sentence patterns as standard sentence patterns comprises
    Acquiring the drawing-in degree of each general sentence pattern in the general sentence patterns based on the sentence pattern inclusion relation among the general sentence patterns;
    and filtering the plurality of general sentence patterns based on the drawing in degree of each general sentence pattern and the complexity of each general sentence pattern, and screening out the general sentence patterns meeting the specified standard from the plurality of general sentence patterns as standard sentence patterns.
  5. The method of claim 4, wherein said screening out a universal sentence pattern from said plurality of universal sentence patterns that meets a specified criteria as a standard sentence pattern comprises:
    and screening out the universal sentence patterns with the drawing in degree meeting a third specified standard and the sentence pattern complexity meeting a second specified standard from the plurality of universal sentence patterns as standard sentence patterns.
  6. The method of claim 5, wherein the step of selecting a universal sentence pattern from the plurality of universal sentence patterns, the universal sentence pattern having an in-degree satisfying a third specified criterion and a sentence complexity satisfying a second specified criterion, as the standard sentence pattern, comprises:
    and screening out the general sentence patterns with the figure in-degree larger than the specified figure in-degree and the sentence pattern complexity larger than the specified complexity from the plurality of general sentence patterns as standard sentence patterns.
  7. The method of any of claims 2-6, wherein said obtaining the schema complexity of each of the plurality of universal schemas comprises:
    based on
    Figure PCTCN2020084769-APPB-100001
    A sentence complexity of each of the plurality of general sentences is obtained, where n denotes the number of times the general sentence is divided and t denotes the number of words of each divided segment in the general sentence.
  8. The method according to any one of claims 1 to 7, wherein said performing a double sequence alignment on said plurality of corpuses to be mined to obtain a plurality of universal sentences corresponding to said plurality of corpuses to be mined comprises:
    acquiring the sequence type of each corpus to be mined in the plurality of corpora to be mined;
    determining a processing mode of performing double-sequence comparison on the plurality of linguistic data to be mined based on the sequence type of each linguistic data to be mined;
    and performing double-sequence comparison on the plurality of linguistic data to be mined based on the processing mode to obtain a plurality of general sentence patterns corresponding to the plurality of linguistic data to be mined.
  9. The method according to claim 8, wherein the determining a processing manner for performing a double sequence alignment on the corpora to be mined based on the sequence type of each corpus to be mined comprises:
    and determining a processing mode for performing double-sequence comparison on the plurality of linguistic data to be mined from the global comparison and the local comparison based on the sequence type of each linguistic data to be mined.
  10. The method according to claim 9, wherein the determining a processing manner for performing a double sequence alignment on the corpora to be mined from the global alignment and the local alignment based on the sequence type of each corpus to be mined comprises:
    acquiring sequence similarity among the plurality of linguistic data to be mined based on the sequence type of each linguistic data to be mined;
    and determining a processing mode for performing double-sequence comparison on the plurality of linguistic data to be mined from the global comparison and the local comparison based on the sequence similarity among the plurality of linguistic data to be mined.
  11. The method according to claim 10, wherein the determining a processing manner for performing a dual sequence alignment on the plurality of corpuses to be mined from the global alignment and the local alignment based on the sequence similarity between the plurality of corpuses to be mined comprises:
    when the sequence similarity among the corpora to be mined is greater than the specified similarity, determining the global comparison as a processing mode of performing double-sequence comparison on the corpora to be mined;
    and when the sequence similarity among the corpora to be mined is not greater than the specified similarity, determining the local comparison as a processing mode of performing double-sequence comparison on the corpora to be mined.
  12. The method of any one of claims 9 to 11, wherein the global alignment comprises a Needleman-Wunsch algorithm.
  13. The method of any one of claims 9 to 12, wherein the local alignment comprises the Smith-Waterman algorithm.
  14. The method according to any of claims 1-13, wherein said filtering said plurality of universal sentences and selecting from said plurality of universal sentences a universal sentence pattern meeting a specified criteria as a standard sentence pattern comprises:
    and outputting the standard sentence pattern.
  15. The method of claim 14, wherein outputting the standard sentence pattern comprises:
    when the standard sentence pattern is a query sentence pattern, acquiring a standard reply sentence pattern based on the standard sentence pattern;
    and outputting the standard sentence pattern and the standard reply sentence pattern.
  16. The method according to any one of claims 1-15, wherein before obtaining the plurality of corpuses to be mined, the method further comprises:
    acquiring a training data set, wherein the training data set comprises a plurality of corpora and a plurality of standard sentence patterns;
    and based on the training data set, taking each corpus as input data and each standard sentence pattern as output data, and training through a machine learning algorithm to obtain a trained sentence pattern mining model.
  17. A sentence excavation apparatus, comprising:
    the corpus to be excavated acquiring module is used for acquiring a plurality of corpuses to be excavated;
    the general sentence pattern obtaining module is used for carrying out double-sequence comparison on the plurality of linguistic data to be mined to obtain a plurality of general sentence patterns corresponding to the plurality of linguistic data to be mined;
    and the standard sentence pattern obtaining module is used for filtering the plurality of general sentence patterns and screening out the general sentence patterns meeting the specified standard from the plurality of general sentence patterns as the standard sentence patterns.
  18. The apparatus of claim 17, wherein the standard schema obtaining module comprises:
    the information acquisition submodule is used for acquiring sentence pattern inclusion relations among the plurality of general sentence patterns and acquiring the sentence pattern complexity of each general sentence pattern in the plurality of general sentence patterns;
    and the standard sentence pattern obtaining submodule is used for filtering the plurality of general sentence patterns based on the sentence pattern inclusion relationship among the plurality of general sentence patterns and the sentence pattern complexity of each general sentence pattern, and screening out the general sentence patterns meeting the specified standard from the plurality of general sentence patterns as the standard sentence patterns.
  19. An electronic device comprising a memory and a processor, the memory coupled to the processor, the memory storing instructions that, when executed by the processor, the processor performs the method of any of claims 1-16.
  20. A computer-readable storage medium having program code stored therein, the program code being invoked by a processor to perform the method of any one of claims 1 to 16.
CN202080094177.6A 2020-04-14 2020-04-14 Sentence pattern mining method, sentence pattern mining device, electronic equipment and storage medium Pending CN115039105A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/084769 WO2021207939A1 (en) 2020-04-14 2020-04-14 Sentence pattern mining method and apparatus, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
CN115039105A true CN115039105A (en) 2022-09-09

Family

ID=78083707

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202080094177.6A Pending CN115039105A (en) 2020-04-14 2020-04-14 Sentence pattern mining method, sentence pattern mining device, electronic equipment and storage medium

Country Status (2)

Country Link
CN (1) CN115039105A (en)
WO (1) WO2021207939A1 (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101221558A (en) * 2008-01-22 2008-07-16 安徽科大讯飞信息科技股份有限公司 Method for automatically extracting sentence template
CN107038163A (en) * 2016-02-03 2017-08-11 常州普适信息科技有限公司 A kind of text semantic modeling method towards magnanimity internet information
CN106649294B (en) * 2016-12-29 2020-11-06 北京奇虎科技有限公司 Classification model training and clause recognition method and device

Also Published As

Publication number Publication date
WO2021207939A1 (en) 2021-10-21

Similar Documents

Publication Publication Date Title
CN111753060B (en) Information retrieval method, apparatus, device and computer readable storage medium
CN107797984B (en) Intelligent interaction method, equipment and storage medium
WO2021169842A1 (en) Method and apparatus for updating data, electronic device, and computer readable storage medium
CN111274797A (en) Intention recognition method, device and equipment for terminal and storage medium
CN112528637A (en) Text processing model training method and device, computer equipment and storage medium
CN116501960B (en) Content retrieval method, device, equipment and medium
US20230244862A1 (en) Form processing method and apparatus, device, and storage medium
CN112579733A (en) Rule matching method, rule matching device, storage medium and electronic equipment
CN116821301A (en) Knowledge graph-based problem response method, device, medium and computer equipment
CN115795030A (en) Text classification method and device, computer equipment and storage medium
CN115730597A (en) Multi-level semantic intention recognition method and related equipment thereof
CN113392205A (en) User portrait construction method, device and equipment and storage medium
CN117493491A (en) Natural language processing method and system based on machine learning
CN113515593A (en) Topic detection method and device based on clustering model and computer equipment
CN112307754A (en) Statement acquisition method and device
CN116680401A (en) Document processing method, document processing device, apparatus and storage medium
CN113869049B (en) Fact extraction method and device with legal attribute based on legal consultation problem
CN116090450A (en) Text processing method and computing device
CN115039105A (en) Sentence pattern mining method, sentence pattern mining device, electronic equipment and storage medium
CN114461749A (en) Data processing method and device for conversation content, electronic equipment and medium
CN113076468A (en) Nested event extraction method based on domain pre-training
CN112632229A (en) Text clustering method and device
CN117236347B (en) Interactive text translation method, interactive text display method and related device
CN114117034B (en) Method and device for pushing texts of different styles based on intelligent model
CN114548083B (en) Title generation method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20230621

Address after: 1301, Office Building T2, Qianhai China Resources Financial Center, No. 55 Guiwan 4th Road, Nanshan Street, Qianhai Shenzhen-Hong Kong Cooperation Zone, Shenzhen, Guangdong Province, 518035

Applicant after: Shenzhen Hefei Technology Co.,Ltd.

Address before: 518052 2501, office building T2, Qianhai China Resources Financial Center, 55 guiwan 4th Road, Nanshan street, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen City, Guangdong Province

Applicant before: Shenzhen Huantai Digital Technology Co.,Ltd.

TA01 Transfer of patent application right