CN113392631A - Corpus expansion method and related device - Google Patents

Corpus expansion method and related device Download PDF

Info

Publication number
CN113392631A
CN113392631A CN202011393376.5A CN202011393376A CN113392631A CN 113392631 A CN113392631 A CN 113392631A CN 202011393376 A CN202011393376 A CN 202011393376A CN 113392631 A CN113392631 A CN 113392631A
Authority
CN
China
Prior art keywords
corpus
component
determining
target
expansion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202011393376.5A
Other languages
Chinese (zh)
Other versions
CN113392631B (en
Inventor
周辉阳
闫昭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202011393376.5A priority Critical patent/CN113392631B/en
Priority claimed from CN202011393376.5A external-priority patent/CN113392631B/en
Publication of CN113392631A publication Critical patent/CN113392631A/en
Application granted granted Critical
Publication of CN113392631B publication Critical patent/CN113392631B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The application discloses a corpus expansion method and a related device, which are applied to artificial intelligence natural language processing technology. Obtaining a target corpus; then extracting a first granularity component and a second granularity component in the target corpus; determining a structural language material associated with the first granularity component to obtain a structural language material set; determining an object corpus associated with the second granularity component to obtain an object corpus set; and then combining the corpora in the structure corpus set and the corpora in the object corpus set to obtain an expanded corpus set. Therefore, the high-efficiency associated corpus expansion process is realized, and the corpus components are analyzed from the composition structure of the corpus and are decomposed based on different granularities, so that the replacement and combination of similar components are carried out, and the efficiency and the accuracy of corpus expansion are improved.

Description

Corpus expansion method and related device
Technical Field
The present application relates to the field of computer technologies, and in particular, to a corpus expansion method and a related apparatus.
Background
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like. In the natural language processing process, a large amount of corpus support is required, for example, for a robot question-answering task, a large amount of question-answer pairs are required to ensure the normal operation of question-answering.
For the expansion of the corpus, the corpus in the specified field is generally expanded by adopting a manual input mode, for example, for a task of question and answer of a robot, a problem which cannot be answered by the robot often occurs, and at this time, the corpus needs to be supplemented by manually inputting question and answer pairs.
However, in an actual scenario, due to the huge data volume, the expansion cycle is long and the development burden is large for the process of manual expansion, which affects the efficiency of corpus expansion.
Disclosure of Invention
In view of this, the present application provides a corpus expansion method, which can effectively improve efficiency and accuracy of corpus expansion.
A first aspect of the present application provides a method for corpus expansion, which may be applied to a system or a program including a corpus expansion function in a terminal device, and specifically includes:
acquiring a target corpus;
extracting a first granularity component and a second granularity component in the target corpus, wherein the first granularity component is used for indicating a corpus structure of the target corpus, the second granularity component is used for indicating a corpus object of the target corpus, and the combination of the corpus structure and the corpus object is the target corpus;
determining a structural corpus associated with the first granular component to obtain a structural corpus set;
determining the object linguistic data associated with the second granularity component to obtain an object linguistic data set;
and combining the linguistic data in the structure linguistic data set and the linguistic data in the object linguistic data set to obtain an expanded linguistic data set.
Optionally, in some possible implementation manners of the present application, the extracting a first granular component and a second granular component in the target corpus includes:
inputting the target corpus into a granularity classifier to obtain a classification label sequence of the target corpus;
and determining a classification result of each component in the target corpus based on the classification tag sequence to obtain the first granularity component and the second granularity component.
Optionally, in some possible implementations of the present application, the method further includes:
acquiring training data containing similar sentence pairs;
performing word alignment on the similar sentence pairs to obtain an alignment result;
counting word frequency information in the alignment result;
determining training object components and training structure components in the similar sentence pairs based on the word frequency information;
and carrying out classification labeling on the training object components and the training structure components so as to train the granularity classifier.
Optionally, in some possible implementation manners of the present application, the determining, based on the word frequency information, a training object component and a training structure component in the similar sentence pair includes:
acquiring a preset hyper-parameter;
determining quantity information of the training object components based on the preset hyper-parameter;
determining the training object components based on the word frequency information and the quantity information;
and taking the part of the similar sentence pair except the training object component as the training structure component.
Optionally, in some possible implementations of the present application, the determining the structural corpus associated with the first granular component to obtain a structural corpus set includes:
acquiring a preset question-answer pair of at least one information source;
taking each corpus in the preset question-answer pair as a template corpus to determine corresponding similar corpuses;
determining a first mapping table based on the template corpus and the similar corpus;
and determining the structural language material associated with the first granularity component according to the first mapping table to obtain the structural language material set.
Optionally, in some possible implementation manners of the present application, the determining, by using each corpus in the preset question-answer pair as a template corpus, a corresponding similar corpus includes:
determining object components and structural components corresponding to each corpus in the preset question-answer pair;
converting the object components into target identifiers to obtain the template corpus;
and determining the corresponding similar corpus based on the template corpus.
Optionally, in some possible implementation manners of the present application, the determining the object corpus associated with the second granularity component to obtain an object corpus set includes:
acquiring preset semantic data;
extracting similar object pairs in the preset semantic data to establish a second mapping table;
and determining the object corpus associated with the second granularity component based on the second mapping table to obtain the object corpus set.
Optionally, in some possible implementation manners of the present application, the extracting similar objects in the preset semantic data to establish a second mapping table includes:
extracting the similar object pairs in the preset semantic data;
converting the similar object pair based on the language category to obtain a converted object pair, wherein the data volume corresponding to the converted object pair is larger than that corresponding to the similar object pair;
and establishing the second mapping table according to the conversion object pair.
Optionally, in some possible implementation manners of the present application, the combining the corpus based on the structural corpus set and the corpus of the object corpus set to obtain an extended corpus set includes:
determining a structural expansion item based on the corpora in the structural corpus set;
determining at least one object expansion item based on the corpora in the object corpus set;
and arranging and combining the structure expansion items and the object expansion items to obtain the expansion corpus set.
Optionally, in some possible implementations of the present application, the method further includes:
inputting the extended corpus set into a preset scoring model to obtain an extended corpus score;
and grading the expanded corpora based on a preset threshold value to screen so as to update the expanded corpus set.
Optionally, in some possible implementations of the present application, the method further includes:
determining reply information corresponding to the target corpus;
and associating the reply information with each item in the extended corpus set to obtain question-answer pair information, wherein the question-answer pair information is used for responding to input feedback reply corpuses of question corpuses.
Optionally, in some possible implementation manners of the present application, the first granularity component is a phrase granularity component, the second granularity component is a sentence granularity component, the question information corresponding to the question-and-answer task in the target corpus machine question-and-answer is used to expand the question information.
The second aspect of the present application provides a corpus expanding device, including:
the acquisition unit is used for acquiring the target corpus;
an extracting unit, configured to extract a first granular component and a second granular component in the target corpus, where the first granular component is used to indicate a corpus structure of the target corpus, the second granular component is used to indicate a corpus object of the target corpus, and a combination of the corpus structure and the corpus object is the target corpus;
a determining unit, configured to determine a structural corpus associated with the first granular component to obtain a structural corpus set;
the determining unit is further configured to determine an object corpus associated with the second granularity component to obtain an object corpus set;
and the expansion unit is used for combining the corpora in the structure corpus set and the corpora in the object corpus set to obtain an expanded corpus set.
Optionally, in some possible implementation manners of the present application, the extracting unit is specifically configured to input the target corpus into a granularity classifier to obtain a classification tag sequence of the target corpus;
the extracting unit is specifically configured to determine a classification result of each component in the target corpus based on the classification tag sequence to obtain the first granular component and the second granular component.
Optionally, in some possible implementation manners of the present application, the extracting unit is specifically configured to obtain training data including similar statement pairs;
the extracting unit is specifically configured to perform word alignment on the similar sentence pairs to obtain an alignment result;
the extraction unit is specifically configured to count word frequency information in the alignment result;
the extracting unit is specifically configured to determine a training object component and a training structure component in the similar sentence pair based on the word frequency information;
the extracting unit is specifically configured to perform classification and labeling on the training object components and the training structure components to train the particle size classifier.
Optionally, in some possible implementation manners of the present application, the determining unit is specifically configured to obtain a preset hyper-parameter;
the determining unit is specifically configured to determine quantity information of the training object components based on the preset hyper-parameter;
the determining unit is specifically configured to determine the training object component based on the word frequency information and the quantity information;
the determining unit is specifically configured to use a part of the similar sentence pair excluding the training object component as the training structure component.
Optionally, in some possible implementation manners of the present application, the determining unit is specifically configured to obtain a preset question-answer pair of at least one information source;
the determining unit is specifically configured to use each corpus in the preset question-answer pair as a template corpus to determine corresponding similar corpuses;
the determining unit is specifically configured to determine a first mapping table based on the template corpus and the similar corpus;
the determining unit is specifically configured to determine, according to the first mapping table, a structural corpus associated with the first granular component, so as to obtain the structural corpus set.
Optionally, in some possible implementation manners of the present application, the determining unit is specifically configured to determine an object component and a structure component corresponding to each corpus in the preset question-answer pair;
the determining unit is specifically configured to convert the object component into a target identifier, so as to obtain the template corpus;
the determining unit is specifically configured to determine the corresponding similar corpus based on the template corpus.
Optionally, in some possible implementation manners of the present application, the determining unit is specifically configured to obtain preset semantic data;
the determining unit is specifically configured to extract a similar object pair in the preset semantic data to establish a second mapping table;
the determining unit is specifically configured to determine, based on the second mapping table, an object corpus associated with the second granular component, so as to obtain the object corpus set.
Optionally, in some possible implementation manners of the present application, the determining unit is specifically configured to extract the similar object pair in the preset semantic data;
the determining unit is specifically configured to convert the similar object pair based on a language category to obtain a conversion object pair, where a data amount corresponding to the conversion object pair is greater than a data amount corresponding to the similar object pair;
the determining unit is specifically configured to establish the second mapping table according to the conversion object pair.
Optionally, in some possible implementation manners of the present application, the extension unit is specifically configured to determine a structural extension item based on the corpus in the structural corpus set;
the extension unit is specifically configured to determine at least one object extension item based on the corpus in the object corpus set;
the extension unit is specifically configured to perform permutation and combination on the structure extension item and the object extension item to obtain the extension corpus set.
Optionally, in some possible implementation manners of the present application, the extension unit is specifically configured to input the extended corpus set into a preset scoring model to obtain an extended corpus score;
the extension unit is specifically configured to screen the extension corpus based on a preset threshold value, so as to update the extension corpus set.
Optionally, in some possible implementation manners of the present application, the extension unit is specifically configured to determine reply information corresponding to the target corpus;
the extension unit is specifically configured to associate the reply information with each item in the extension corpus set to obtain question-answer pair information, where the question-answer pair information is used to respond to an input feedback reply corpus of the question corpus.
A third aspect of the present application provides a computer device comprising: a memory, a processor, and a bus system; the memory is used for storing program codes; the processor is configured to execute the corpus expansion method according to any one of the first aspect or the first aspect according to an instruction in the program code.
A fourth aspect of the present application provides a computer-readable storage medium, having stored therein instructions, which, when run on a computer, cause the computer to perform the method for corpus expansion according to any one of the first aspect or the first aspect.
According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the computer device performs the method for corpus expansion provided in the first aspect or the various alternative implementations of the first aspect.
According to the technical scheme, the embodiment of the application has the following advantages:
obtaining a target corpus; then extracting a first granularity component and a second granularity component in the target corpus, wherein the first granularity component is used for indicating the corpus structure of the target corpus, the second granularity component is used for indicating the corpus object of the target corpus, and the corpus structure and the corpus object are combined into the target corpus; determining a structural language material associated with the first granularity component to obtain a structural language material set; further determining an object corpus associated with the second granularity component to obtain an object corpus set; and then combining the corpora in the structure corpus set and the corpora in the object corpus set to obtain an expanded corpus set. Therefore, the efficient expansion process of the associated corpora is realized, and the corpora components are analyzed from the composition structure of the corpora and are decomposed based on different granularities, so that the similar components are replaced and combined, and the efficiency and the accuracy of the corpora expansion are improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.
FIG. 1 is a diagram of a network architecture for the operation of a corpus extended system;
FIG. 2 is a flowchart of corpus expansion according to an embodiment of the present application;
FIG. 3 is a flowchart of a corpus expansion method according to an embodiment of the present application;
fig. 4 is a schematic scene diagram of a corpus expansion method according to an embodiment of the present application;
FIG. 5 is a schematic diagram illustrating a scenario of another corpus expansion method according to an embodiment of the present application;
FIG. 6 is a schematic diagram illustrating a scenario of another corpus expansion method according to an embodiment of the present application;
FIG. 7 is a flowchart of another corpus expansion method according to an embodiment of the present application;
FIG. 8 is a schematic diagram illustrating a scenario of another corpus expansion method according to an embodiment of the present application;
FIG. 9 is a flowchart of another corpus expansion method according to an embodiment of the present application;
FIG. 10 is a schematic structural diagram of a corpus expansion apparatus according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
The embodiment of the application provides a corpus expansion method and a related device, which can be applied to a system or a program containing a corpus expansion function in terminal equipment, and target corpus is obtained; then extracting a first granularity component and a second granularity component in the target corpus, wherein the first granularity component is used for indicating the corpus structure of the target corpus, the second granularity component is used for indicating the corpus object of the target corpus, and the corpus structure and the corpus object are combined into the target corpus; determining a structural language material associated with the first granularity component to obtain a structural language material set; further determining an object corpus associated with the second granularity component to obtain an object corpus set; and then combining the corpora in the structure corpus set and the corpora in the object corpus set to obtain an expanded corpus set. Therefore, the efficient expansion process of the associated corpora is realized, and the corpora components are analyzed from the composition structure of the corpora and are decomposed based on different granularities, so that the similar components are replaced and combined, and the efficiency and the accuracy of the corpora expansion are improved.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are, for example, capable of operation in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "corresponding" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
First, some nouns that may appear in the embodiments of the present application are explained.
A semantic classifier: the predicted linguistic data learned by deep learning belongs to a classifier of a certain field and intention.
It should be understood that the corpus expansion method provided by the present application may be applied to a system or a program including a corpus expansion function in a terminal device, for example, an intelligent assistant, specifically, the corpus expansion system may operate in a network architecture as shown in fig. 1, which is a network architecture diagram of the corpus expansion system, as can be seen from the diagram, the corpus expansion system may provide a corpus expansion process with multiple information sources, that is, the corpus expansion is performed through corpus input at a terminal side, and the corpus is expanded in a service, and the expanded corpus is associated with response information, so as to implement an intelligent question and answer process; it can be understood that fig. 1 shows various terminal devices, the terminal devices may be computer devices, in an actual scene, there may be more or fewer types of terminal devices participating in the process of corpus expansion, the specific number and type are determined according to the actual scene, and are not limited herein, and in addition, fig. 1 shows one server, but in an actual scene, there may also be participation of multiple servers, and the specific number of servers is determined according to the actual scene.
In this embodiment, the server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a network service, cloud communication, a middleware service, a domain name service, a security service, a CDN, a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through a wired or wireless communication manner, and the terminal and the server may be connected to form a block chain network, which is not limited herein.
It is understood that the system for corpus expansion described above may be operated in a personal mobile terminal, for example: the application, such as an intelligent assistant, can also run on a server, and can also run on a third-party device to provide corpus expansion, so as to obtain a processing result of the corpus expansion of the information source; the system for the corpus expansion may be operated in the above-mentioned device in the form of a program, may also be operated as a system component in the above-mentioned device, and may also be used as one of cloud service programs, and a specific operation mode is determined by an actual scene, which is not limited herein.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like. In the natural language processing process, a large amount of corpus support is required, for example, for a robot question-answering task, a large amount of question-answer pairs are required to ensure the normal operation of question-answering.
For the expansion of the corpus, the corpus in the specified field is generally expanded by adopting a manual input mode, for example, for a task of question and answer of a robot, a problem which cannot be answered by the robot often occurs, and at this time, the corpus needs to be supplemented by manually inputting question and answer pairs.
However, in an actual scenario, due to the huge data volume, the expansion cycle is long and the development burden is large for the process of manual expansion, which affects the efficiency of corpus expansion.
In order to solve the above problem, the present application provides a corpus expansion method, which is applied to a corpus expansion flow framework shown in fig. 2, and as shown in fig. 2, for a corpus expansion flow framework provided in an embodiment of the present application, a user inputs a target corpus by performing an interactive operation on a terminal, a server performs a granularity division based on a similar sentence pair according to the target corpus to obtain coarse-granularity and fine-granularity corpus components, and performs component expansion on the corpus components respectively, and then performs permutation and combination on the expanded corpus to obtain a plurality of corpora similar to the target corpus, so that the server can respond to the input of the terminal for invocation.
It can be understood that the method provided by the present application may be a program written as a processing logic in a hardware system, or may be a corpus expansion device, and the processing logic is implemented in an integrated or external manner. As an implementation manner, the corpus expanding device obtains a target corpus; then extracting a first granularity component and a second granularity component in the target corpus, wherein the first granularity component is used for indicating the corpus structure of the target corpus, the second granularity component is used for indicating the corpus object of the target corpus, and the corpus structure and the corpus object are combined into the target corpus; determining a structural language material associated with the first granularity component to obtain a structural language material set; further determining an object corpus associated with the second granularity component to obtain an object corpus set; and then combining the corpora in the structure corpus set and the corpora in the object corpus set to obtain an expanded corpus set. Therefore, the efficient expansion process of the associated corpora is realized, and the corpora components are analyzed from the composition structure of the corpora and are decomposed based on different granularities, so that the similar components are replaced and combined, and the efficiency and the accuracy of the corpora expansion are improved.
The scheme provided by the embodiment of the application relates to an artificial intelligence natural language processing technology, and is specifically explained by the following embodiment:
with reference to the above flow architecture, the following describes a method for corpus expansion in the present application, please refer to fig. 3, where fig. 3 is a flow chart of a method for corpus expansion provided in an embodiment of the present application, where the management method may be executed by a terminal device, a server, or both the terminal device and the server, and the embodiment of the present application at least includes the following steps:
301. acquiring a target corpus;
in this embodiment, since the target corpus and the corpus to be expanded have the associated answer corpus, after the target corpus is expanded, a corresponding relationship needs to be established with the associated answer corpus, so as to facilitate the invocation of the answer corpus.
In a possible scenario, the target corpus may be obtained in response to a trigger of a specific answer, for example, in a scenario of a robot question and answer, if the robot receives a question that does not correspond to the answer, a fixed answer mode may be adopted. As shown in fig. 4, fig. 4 is a scene schematic diagram of a corpus expansion method according to an embodiment of the present application; the user enters the target corpus a1 "how can the student not go to the classroom? However, since the robot does not store the answer mode corresponding to the question, the robot responds with the fixed answer a2, and at this time, the corpus triggering the fixed answer may be selected as the target corpus in response to the issuance of the fixed answer.
In another possible scenario, the target corpus may also be triggered in response to a manual upload by the user during the interaction of the robot with the user, such as the user being dissatisfied with the response. As shown in fig. 5, fig. 5 is a schematic view of a scene of another corpus expansion method according to an embodiment of the present application; the figure shows that the user is dissatisfied with the response information B1 "please visit the classroom official website query" of the robot, so that the user can press the dialog box for a long time to make a complaint with the "unsolved problem" B2, and in response to the generation of the complaint, select the problem corpus corresponding to the complaint as the target corpus. The specific complaint mode depends on the actual scene, and is not limited herein.
It can be understood that the above-mentioned scenario of obtaining the target corpus is only an example, and the specific scenario of natural language processing applied to the corpus can be performed by the method provided in the present application.
302. And extracting a first granularity component and a second granularity component in the target corpus.
In this embodiment, the first granularity component is used to indicate a corpus structure of a target corpus, the second granularity component is used to indicate a corpus object of the target corpus, and the corpus structure and the corpus object are combined into the target corpus; wherein, the corpus structure is a part without real meaning in the target corpus, for example: prepositions, mysteries, etc.; the corpus object is a correspondence included in the target corpus, for example: a noun; in the method from Beijing to Chengdu, the "Beijing" and "Chengdu" are corpus objects, i.e. second granularity components, and the rest are corpus structures, i.e. first granularity components.
It can be understood that the target corpus can be obtained by combining the first granularity component and the second granularity component, that is, the dimension of expanding the corpus in this embodiment is performed based on the corpus as a whole, and compared with the single replacement of the similar meaning word, the corpus has a more complete structure and a more similar semantic meaning.
In one possible scenario, the first granular component may be a sentence component, i.e., a coarse-grained component; and the second size component may be a phrase component, i.e., a fine-grained component; the specific component division rule depends on the specific syntactic structure, and the division modes satisfying the above granularity relationship and component relationship are all applicable to the present embodiment, and are not limited herein.
Optionally, the process of extracting the first granularity component and the second granularity component in the target corpus may be performed based on a granularity classifier, that is, the target corpus is input into the granularity classifier to obtain a classification tag sequence of the target corpus; and then determining the classification result of each component in the target corpus based on the classification label sequence to obtain a first granularity component and a second granularity component. The granularity classifier can rapidly identify classification labels (a first granularity component and a second granularity component) corresponding to all granularity components of the target corpus, so that further expansion operation is carried out, and the efficiency and the accuracy of component division are improved.
Specifically, the granularity classifier may be obtained by training based on training data including similar sentence pairs, that is, the granularity classifier may learn the granularity component characteristics of the corpus in the similar sentence pairs, thereby improving the classification accuracy. Firstly, acquiring training data containing similar sentence pairs; then, word alignment is carried out on the similar sentence pairs to obtain an alignment result; carrying out asynchronous statistics on word frequency information in the alignment result; determining training object components and training structure components in the similar sentence pairs based on the word frequency information; and classifying and labeling the training object components and the training structure components so as to train the granularity classifier. Therefore, the classification accuracy of the granularity classifier is ensured.
Optionally, a hyper-parameter, i.e., the number of training object components that exist at most in one sentence, may be set for the training object components. Specifically, a preset hyper-parameter is obtained firstly; then determining quantity information of the training object components based on the preset hyper-parameters; determining a training object component based on the word frequency information and the quantity information; and then the part of the similar sentence pair except the training object component is used as a training structure component. Thereby ensuring the accuracy of the ingredient division.
In one possible scenario, a pair of similar sentences "logistics company of Beijing to Chengdu" and "freight company of Beijing to Chengdu" and then word alignment can result in: Beijing-Beijing; to; Chengdu-Chengdu; logistics-freight transportation; company-company; then, the word frequency statistics is performed on the word alignment result, so that the following results can be obtained: achievement + achievement/2 < Beijing + Beijing/2 < Logistics + shipping/2 < company + company/2; furthermore, the set hyper-parameters are 2 training object components in a sentence, and the rest high-frequency words belong to training structure components. Therefore, the parameters of the components of Beijing Heji as the training object and the parameters of the components of the training structure of the rest parts can be obtained, so the classification labeling condition of the above example is as follows: [ "Beijing to Chengdu Logistics company": [0,0,1,0,0,1,1,1,1,1 ]; "Beijing to Chengdu freight" [0,0,1,0,0,1,1,1,1,1] ], thereby inputting a particle size classifier for training.
303. Determining a structural corpus associated with the first granular component to obtain a set of structural corpora.
In this embodiment, in the process of determining the structure corpus associated with the first granularity component, that is, for the expansion of the first granularity component, since the first granularity component is not obtained by coarse-grained division and needs a specific sentence to be expanded, the expansion can be performed quickly by constructing the first mapping table.
Specifically, the first mapping table may be a sentence-level mapping table, and may first obtain a preset question-answer pair of at least one information source; then, each corpus in a preset question-answer pair is used as a template corpus to determine corresponding similar corpora; determining a first mapping table based on the template corpus and the similar corpus; and determining the structural language data associated with the first granularity component according to the first mapping table to obtain a structural language data set. The information source may be a local similar statement pair, a similar statement for network retrieval, or the like, and the specific information source composition depends on the actual scene.
In a possible scenario, since a and B are similar corpora to each other and a and C are similar corpora to each other, the first mapping table is established as { "a": [ B, C ], "B": [ A, C ], "C": [ A, B ] }, namely for the first granularity component A, B and C can be called quickly, and the component expansion efficiency is improved.
Optionally, in the process of determining similar corpora, in order to avoid interference of object components, the object components may be replaced with special symbols, that is, the object components and the structural components corresponding to each corpus in a preset question-answer pair are determined first; then converting the object components into target identifiers to obtain template corpora; and then determining corresponding similar corpora based on the template corpora. Therefore, the interference of the object components to the similar corpus determining process is eliminated, and the accuracy of the first mapping table is improved.
304. And determining the object corpora associated with the second granularity component to obtain an object corpus set.
In this embodiment, the process of determining the object corpus associated with the second granularity component is to expand the second granularity component, and the second granularity component is obtained by fine-grained division and needs to be expanded by a specific phrase object, so that rapid expansion can be performed by constructing the second mapping table.
Specifically, the second mapping table may be a phrase-level mapping table, and may first obtain preset semantic data, such as a near-synonym database; then extracting similar object pairs in the preset semantic data to establish a second mapping table; and then determining the object corpus associated with the second granularity component based on the second mapping table to obtain an object corpus set, so that the efficiency of granularity component expansion is improved.
Optionally, because fine-grained phrases may be expressed in different language categories, further expansion may be performed based on language type dimensions. Firstly, extracting similar object pairs in preset semantic data; converting the similar object pairs based on the language category to obtain converted object pairs, wherein the data volume corresponding to the converted object pairs is larger than that corresponding to the similar object pairs, for example, similar words in English translation are extracted; and a second mapping table is established according to the conversion object pair, so that the generalization degree of component expansion is further improved.
In a possible scenario, as shown in fig. 6, fig. 6 is a schematic view of another scenario of a corpus expansion method according to an embodiment of the present application; the figure shows the process of converting similar object pairs based on language category, i.e. the process of inter-English-Chinese translation, and after words are aligned and translated into English, the translation software also gives other Chinese expression C1 corresponding to the translation, so that the translation software can expand the words. For example, phrase A gets multiple synonyms at the translation level: a1, a2, A3 … …. Phrase B can also do this, and can also get synonyms for B at multiple translation levels: b1, B2, B3 … …. Further obtaining a mapping relation: { "A": [ B, A1, A2, A3, … …, B1, B2, B3, … … ], "B": [ A, A1, A2, A3, … …, B1, B2, B3, … … ] }, thereby increasing the dimension of phrase expansion.
305. And combining the corpora in the structure corpus set and the corpora in the object corpus set to obtain an expanded corpus set.
In this embodiment, if the number of corpora in the structural corpus set is m and the number of corpora in the object corpus set is n, m × n expanded corpora can be obtained by combining, and the expanded corpora are similar to the target corpora, so that the expanded corpora can be triggered for the reply of the target corpora, and the workload of reply setting is greatly reduced.
Optionally, the object corpus set may include a plurality of object corpus sets corresponding to different object corpuses, so that the structural expansion item may be determined based on the corpuses in the structural corpus set; then determining at least one object expansion item based on the corpora in the object corpus set; and then arranging and combining the structure expansion items and the object expansion items to obtain an expansion corpus set. For example, the number of the structural expansion items is m, and the object expansion items include n expansion items corresponding to the object 1 and p expansion items corresponding to the object 2, so that the obtained expansion corpus is m × n × p, and the range similar to the target corpus is further expanded.
Optionally, in order to ensure the reliability of the expanded corpus, a corpus scoring process may be performed, that is, the expanded corpus set is input into a preset scoring model to obtain expanded corpus scoring; and then, based on a preset threshold value, grading and screening the expanded corpora so as to update the expanded corpus set. The preset scoring model can adopt a BERT model, and the goal of the model is to train and obtain the repetition of the text containing rich semantic information by using large-scale unmarked corpus, namely: and performing semantic representation on the text, then performing fine adjustment on the semantic representation of the text in a specific NLP task, and finally applying the semantic representation of the text to the NLP task. For example: if the preset threshold is 0.8, the expansion items with the score less than 0.8 of the expansion corpus are all screened out, so that the accuracy of the expansion corpus is ensured.
It can be understood that the expanded corpus set in the embodiment may be applied to expansion of question information in a robot question-and-answer scenario, that is, first, answer information corresponding to a target corpus (question information) is determined; and then associating the reply information with each item in the expanded corpus set to obtain question-answer pair information, wherein the question-answer pair information is used for responding to the input of the next question corpus and feeding back the reply corpus, so that redundant reply association processes are avoided.
It can be understood that the above corpus expansion process can also be applied to the expansion of response information, that is, one question can correspond to different answers, so as to improve the richness of the answers, and the specific expansion object depends on the actual scene.
With the above embodiment, the target corpus is obtained; then extracting a first granularity component and a second granularity component in the target corpus, wherein the first granularity component is used for indicating the corpus structure of the target corpus, the second granularity component is used for indicating the corpus object of the target corpus, and the corpus structure and the corpus object are combined into the target corpus; determining a structural language material associated with the first granularity component to obtain a structural language material set; further determining an object corpus associated with the second granularity component to obtain an object corpus set; and then combining the corpora in the structure corpus set and the corpora in the object corpus set to obtain an expanded corpus set. Therefore, the efficient expansion process of the associated corpora is realized, and the corpora components are analyzed from the composition structure of the corpora and are decomposed based on different granularities, so that the similar components are replaced and combined, and the efficiency and the accuracy of the corpora expansion are improved.
The following description is made in conjunction with a specific scenario in which the first granularity component is a sentence-level component and the second granularity component is a phrase-level component. Referring to fig. 7, fig. 7 is a flowchart of another corpus expanding method according to an embodiment of the present application, where the embodiment of the present application at least includes the following steps:
701. and acquiring a question and answer task corpus.
In this embodiment, the query and answer task corpus may be an unanswered question corpus collected over a period of time, a question corpus output in real time, a collected answer corpus, and the like, and the specific corpus form is determined by an actual scene.
702. And training a granularity classifier.
In this embodiment, the training process of the granular classifier includes obtaining training data, that is, the training data is a pair of similar sentences, for example, two sentences form a sentence pair, and the meaning and intention of the two sentences are the same, for example: (Beijing to Chengdu Logistics, Beijing to Chengdu freight Corp).
Word alignment is then performed on pairs of similar sentences, which can be obtained, for example, using the open source tool berkeley aligner: Beijing-Beijing; to; Chengdu-Chengdu; logistics-freight transportation; company-company.
Further, the word frequency statistics is performed on the results of all word alignments in the previous step, so that the following results can be obtained: the method comprises the following steps of (1) acquiring the situation that the achievement + the achievement/2 < Beijing + the Beijing/2 < logistics + freight/2 < company + company/2, and because the set super parameter is 2, the obtained Beijing and the achievement are phrase level parameters, and the rest parts form sentence level parameters, the labeling situation of similar sentence pairs is as follows: [ "Beijing to Chengdu Logistics company": [0,0,1,0,0,1,1,1,1,1] "Beijing to DouDou Vary" [0,0,1,0,0,1,1,1,1,1] ] ".
And further training a granularity classifier by using the labeled corpora obtained in the last step to automatically identify the content of sentences and phrases in a sentence.
703. Sentence level expansion is performed.
In this embodiment, similar question-answer pairs obtained by different data sources (for example, similar question-answer pairs of a local platform, similar question-answer pairs of a network open source, positive sample similar question-answer pairs of a semantic similar task, and the like, then, using the well-established granularity classifiers in step 702 for prediction and replacing the entities at the phrase level with @ symbols, respectively, becomes: @ to @ logistics company, @ to @ freight company, then a mapping is created for each template, key is the template itself and value is a similar template for all the templates. The A and B templates are similar, the A and C templates are also similar, the result of the setup is then { "a": [ B, C ], "B": [ A, C ], "C": and (4) analogizing the rest conditions of [ A, B ] } … … in turn, thereby establishing the mapping relation of all similar question answers to the constructed sentence level.
Further, sentence expansion is performed based on the sentence-level mapping relationship, that is, first, the query is subjected to granularity classification by using a model of a granularity classifier, for example, the following results are obtained: [ "Beijing to Chengdu Logistics company" ]: [0,0,1,0,0,1,1,1,1,1], further, the phrase-level components can be replaced with @ symbols, and the sentence-level components can be degenerated into templates: and the @ to the @ logistics company is expanded at a sentence level, a sentence-level mapping table is searched to obtain [ "@ to @ freight company", "@ to @ freight special line", "@ to @ logistics", "@ to @ express", and … … ], and a plurality of similar sentence templates are obtained through a sentence template and a mapping relation.
704. Phrase level expansion is performed.
In this embodiment, the database that can be used is a chinese semantic similarity database. Further, the following extension can be made by bilingual inter-translation: assuming that a and B are synonyms, a is translated into another language (such as english), then the obtained english is translated into chinese, and the translator recommends a plurality of synonyms in the process of translating into chinese, and the specific language type depends on the actual scene, so a obtains a plurality of synonyms at the translation level: a1, a2, A3 … …. Similarly, B can also do this and get synonyms for B at multiple translation levels: b1, B2, B3 … …. Therefore, the mapping relationship is obtained finally: { "A": [ B, A1, A2, A3, … …, B1, B2, B3, … … ], "B": [ A, A1, A2, A3, … …, B1, B2, B3, … … ] }, and the mapping word list relationship of phrase level is greatly expanded through the expansion in the above mode.
And then performing phrase level extension based on the mapping word list, namely performing extension respectively for each entity in the sentence level template, for example: @ to @ logistics company, the first @ means: beijing, second @ means: the success rate is high. The mapping of Beijing via phrase level can be generalized as: { "Beijing": [ Yanjing, Beijing City, capital, … … ] }, the achievement map of the short-term level can be generalized as: { "Chengdu": [ Hillchwa, Jincheng, Chengdu, Sichuan Chengdu, … … ] }.
705. And carrying out permutation and combination.
In this embodiment, the template to be targeted: @ to @ logistics company. Assuming that the sentence-level generalization expansion can obtain m results, the first phrase generalizes n results, and the second phrase generalizes p results, the final generalization result is: and m × n × p results, so that a large number of similar corpora are obtained by generalization of a problem.
706. And (4) scoring and filtering the combined result.
In this embodiment, in order to ensure that there are some cases with poor generalization inevitably, it may be first necessary to automatically filter the generalized result by using a bert scoring model, and all the generalized similar corpora and the source starting corpus whose similarity is smaller than a threshold (0.65) are filtered. Therefore, bad results are automatically filtered, good results are reserved, and the audit of labor cost is omitted.
In another possible scenario, an interaction process of the terminal device and the server for corpus expansion is shown in fig. 8, and fig. 8 is a schematic view of a scenario of another corpus expansion method provided in this embodiment of the present application. After the user triggers "error report response" D1, the server receives the question information, and further performs the following steps:
801. and the server performs question and answer configuration.
In this embodiment, the question-answer configuration is to configure appropriate response information for the question information.
802. And the server expands the linguistic data.
In this embodiment, the process of corpus expansion refers to the process shown in fig. 3 or fig. 7, which is not described herein again.
803. The server updates the challenge-response pair data.
In this embodiment, by updating the data with the questions and answers, on one hand, when the user proposes the question again, a suitable response may be made, and on the other hand, when a similar question is received, a corresponding response may also be made, so that the "configuration completed" dialog box D2 may be generated to remind the user, thereby improving the user experience.
In a possible scenario, the corpus expansion process may be performed by the server after collecting an actual piece of data, which is described below. Referring to fig. 9, fig. 9 is a flowchart of another corpus expanding method according to an embodiment of the present application, where the embodiment of the present application at least includes the following steps:
901. and acquiring abnormal question information in the robot question-answering process.
In this embodiment, the abnormal question information may correspond to a certain number of abnormal questions, for example, 10 unanswered or complained questions; the abnormal question information may also correspond to collected information of questions that have not been answered or complained for a certain period of time, and the specific information form depends on the actual scene, which is not limited herein.
902. And setting response information aiming at the abnormal problem information.
903. And performing corpus expansion based on the abnormal problem information.
In this embodiment, the process of corpus expansion refers to the process shown in fig. 3 or fig. 7, which is not described herein again.
904. And associating the expanded abnormal problem information with the response information.
In this embodiment, the expanded abnormal question information is associated with the response information, so that the relevant question is properly solved, and the recognition range of the question is expanded.
The method and the device can be applied to corpus supplement work in various fields such as intelligent assistants, and when corpora needs to be supplemented and corpora are expanded, high availability of the corpora in each field and diversity of the query method can be guaranteed through input models and addition, so that the experience result of the AI product in each field is more and more accurate, and the satisfaction degree of a user is improved. For example, the input: "the student can not go into my classroom", the model outputs: "students are prohibited to enter the classroom", "students cannot enter the classroom, and are prohibited to enter the classroom", "students are prohibited to enter the Tencent classroom and do so", "students cannot enter the Tencent classroom and do so", "students find my classroom" … … ", and further, corresponding responses can be obtained for the problems.
The present application goes through sentence level (coarse granularity) and phrase level (fine granularity). And then rewriting and replacing are carried out respectively aiming at the sentence level and the phrase level. Assuming that there are m alternatives at the sentence level and n alternatives at the phrase level, the number of rewrites that can be finally obtained for a question (query) query is (m × n), and it can be seen that our patent is very powerful for query generalization capability when both m and n are large. And the queries generated in the mode are all queries with higher quality, and the queries can be filtered later through a BERT scoring mechanism, so that some queries with poor reliability are automatically filtered, the cost of manual review is saved, and the accuracy of expansion is improved.
In order to better implement the above-mentioned aspects of the embodiments of the present application, the following also provides related apparatuses for implementing the above-mentioned aspects. Referring to fig. 10, fig. 10 is a schematic structural diagram of a corpus expanding device according to an embodiment of the present application, where the corpus expanding device 1000 includes:
an obtaining unit 1001 configured to obtain a target corpus;
an extracting unit 1002, configured to extract a first granular component and a second granular component in the target corpus, where the first granular component is used to indicate a corpus structure of the target corpus, the second granular component is used to indicate a corpus object of the target corpus, and a combination of the corpus structure and the corpus object is the target corpus;
a determining unit 1003, configured to determine a structural corpus associated with the first granular component, so as to obtain a structural corpus set;
the determining unit 1003 is further configured to determine an object corpus associated with the second granularity component, so as to obtain an object corpus set;
an expanding unit 1004, configured to combine the corpora in the structure corpus set and the corpora in the object corpus set to obtain an expanded corpus set.
Optionally, in some possible implementation manners of the present application, the extracting unit 1002 is specifically configured to input the target corpus into a granularity classifier to obtain a classification tag sequence of the target corpus;
the extracting unit 1002 is specifically configured to determine a classification result of each component in the target corpus based on the classification tag sequence, so as to obtain the first granularity component and the second granularity component.
Optionally, in some possible implementation manners of the present application, the extracting unit 1002 is specifically configured to obtain training data including similar statement pairs;
the extracting unit 1002 is specifically configured to perform word alignment on the similar sentence pairs to obtain an alignment result;
the extracting unit 1002 is specifically configured to count word frequency information in the alignment result;
the extracting unit 1002 is specifically configured to determine a training object component and a training structure component in the similar sentence pair based on the word frequency information;
the extracting unit 1002 is specifically configured to perform classification and labeling on the training object components and the training structure components, so as to train the granularity classifier.
Optionally, in some possible implementation manners of the present application, the determining unit 1003 is specifically configured to obtain a preset hyper-parameter;
the determining unit 1003 is specifically configured to determine quantity information of the training object components based on the preset hyper-parameter;
the determining unit 1003 is specifically configured to determine the training object component based on the word frequency information and the quantity information;
the determining unit 1003 is specifically configured to use a part of the similar sentence pair excluding the training object component as the training structure component.
Optionally, in some possible implementation manners of the present application, the determining unit 1003 is specifically configured to obtain a preset question-answer pair of at least one information source;
the determining unit 1003 is specifically configured to use each corpus in the preset question-answer pair as a template corpus to determine corresponding similar corpuses;
the determining unit 1003 is specifically configured to determine a first mapping table based on the template corpus and the similar corpus;
the determining unit 1003 is specifically configured to determine, according to the first mapping table, a structural corpus associated with the first granular component, so as to obtain the structural corpus set.
Optionally, in some possible implementation manners of the present application, the determining unit 1003 is specifically configured to determine an object component and a structure component corresponding to each corpus in the preset question-answer pair;
the determining unit 1003 is specifically configured to convert the object component into a target identifier, so as to obtain the template corpus;
the determining unit 1003 is specifically configured to determine the corresponding similar corpus based on the template corpus.
Optionally, in some possible implementation manners of the present application, the determining unit 1003 is specifically configured to obtain preset semantic data;
the determining unit 1003 is specifically configured to extract a similar object pair in the preset semantic data to establish a second mapping table;
the determining unit 1003 is specifically configured to determine, based on the second mapping table, an object corpus associated with the second granularity component, so as to obtain the object corpus set.
Optionally, in some possible implementation manners of the present application, the determining unit 1003 is specifically configured to extract the similar object pair in the preset semantic data;
the determining unit 1003 is specifically configured to convert the similar object pair based on a language category to obtain a conversion object pair, where a data amount corresponding to the conversion object pair is greater than a data amount corresponding to the similar object pair;
the determining unit 1003 is specifically configured to establish the second mapping table according to the conversion object pair.
Optionally, in some possible implementations of the present application, the extension unit 1004 is specifically configured to determine a structural extension item based on a corpus in the structural corpus set;
the extension unit 1004 is specifically configured to determine at least one object extension item based on the corpus in the object corpus set;
the extension unit 1004 is specifically configured to perform permutation and combination on the structure extension item and the object extension item to obtain the extension corpus set.
Optionally, in some possible implementation manners of the present application, the extension unit 1004 is specifically configured to input the extended corpus set into a preset scoring model to obtain an extended corpus score;
the extension unit 1004 is specifically configured to screen the extension corpus based on a preset threshold, so as to update the extension corpus set.
Optionally, in some possible implementations of the present application, the extension unit 1004 is specifically configured to determine reply information corresponding to the target corpus;
the extension unit 1004 is specifically configured to associate the reply information with each item in the extension corpus set to obtain question-answer pair information, where the question-answer pair information is used for responding to an input feedback reply corpus of the question corpus.
Obtaining a target corpus; then extracting a first granularity component and a second granularity component in the target corpus, wherein the first granularity component is used for indicating the corpus structure of the target corpus, the second granularity component is used for indicating the corpus object of the target corpus, and the corpus structure and the corpus object are combined into the target corpus; determining a structural language material associated with the first granularity component to obtain a structural language material set; further determining an object corpus associated with the second granularity component to obtain an object corpus set; and then combining the corpora in the structure corpus set and the corpora in the object corpus set to obtain an expanded corpus set. Therefore, the efficient expansion process of the associated corpora is realized, and the corpora components are analyzed from the composition structure of the corpora and are decomposed based on different granularities, so that the similar components are replaced and combined, and the efficiency and the accuracy of the corpora expansion are improved.
An embodiment of the present application further provides a terminal device, as shown in fig. 11, which is a schematic structural diagram of another terminal device provided in the embodiment of the present application, and for convenience of description, only a portion related to the embodiment of the present application is shown, and details of the specific technology are not disclosed, please refer to a method portion in the embodiment of the present application. The terminal may be any terminal device including a mobile phone, a tablet computer, a Personal Digital Assistant (PDA), a point of sale (POS), a vehicle-mounted computer, and the like, taking the terminal as the mobile phone as an example:
fig. 11 is a block diagram illustrating a partial structure of a mobile phone related to a terminal provided in an embodiment of the present application. Referring to fig. 11, the cellular phone includes: radio Frequency (RF) circuitry 1110, memory 1120, input unit 1130, display unit 1140, sensors 1150, audio circuitry 1160, wireless fidelity (WiFi) module 1170, processor 1180, and power supply 1190. Those skilled in the art will appreciate that the handset configuration shown in fig. 11 is not intended to be limiting and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components.
The following describes each component of the mobile phone in detail with reference to fig. 11:
RF circuit 1110 may be used for receiving and transmitting signals during a message transmission or call, and in particular, for receiving downlink messages from a base station and then processing the received downlink messages to processor 1180; in addition, the data for designing uplink is transmitted to the base station. In general, RF circuit 1110 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a Low Noise Amplifier (LNA), a duplexer, and the like. In addition, the RF circuitry 1110 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), email, Short Message Service (SMS), etc.
The memory 1120 may be used to store software programs and modules, and the processor 1180 may execute various functional applications and data processing of the mobile phone by operating the software programs and modules stored in the memory 1120. The memory 1120 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the cellular phone, and the like. Further, the memory 1120 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The input unit 1130 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the cellular phone. Specifically, the input unit 1130 may include a touch panel 1131 and other input devices 1132. The touch panel 1131, also referred to as a touch screen, can collect touch operations of a user on or near the touch panel 1131 (for example, operations of the user on or near the touch panel 1131 using any suitable object or accessory such as a finger, a stylus pen, etc., and a range of touch operations on the touch panel 1131 in an interval), and drive the corresponding connection device according to a preset program. Alternatively, the touch panel 1131 may include two parts, namely, a touch detection device and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 1180, and can receive and execute commands sent by the processor 1180. In addition, the touch panel 1131 can be implemented by using various types, such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 1130 may include other input devices 1132 in addition to the touch panel 1131. In particular, other input devices 1132 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.
The display unit 1140 may be used to display information input by the user or information provided to the user and various menus of the cellular phone. The display unit 1140 may include a display panel 1141, and optionally, the display panel 1141 may be configured in the form of a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), or the like. Further, the touch panel 1131 can cover the display panel 1141, and when the touch panel 1131 detects a touch operation on or near the touch panel, the touch panel is transmitted to the processor 1180 to determine the type of the touch event, and then the processor 1180 provides a corresponding visual output on the display panel 1141 according to the type of the touch event. Although in fig. 11, the touch panel 1131 and the display panel 1141 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 1131 and the display panel 1141 may be integrated to implement the input and output functions of the mobile phone.
The handset may also include at least one sensor 1150, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 1141 according to the brightness of ambient light, and the proximity sensor may turn off the display panel 1141 and/or the backlight when the mobile phone moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the magnitude of acceleration in each direction (generally, three axes), can detect the magnitude and direction of gravity when stationary, and can be used for applications of recognizing the posture of a mobile phone (such as horizontal and vertical screen switching, related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer and tapping), and the like; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which can be configured on the mobile phone, further description is omitted here.
Audio circuitry 1160, speakers 1161, and microphone 1162 may provide an audio interface between a user and a cell phone. The audio circuit 1160 may transmit the electrical signal converted from the received audio data to the speaker 1161, and convert the electrical signal into a sound signal for output by the speaker 1161; on the other hand, the microphone 1162 converts the collected sound signals into electrical signals, which are received by the audio circuit 1160 and converted into audio data, which are then processed by the audio data output processor 1180, and then transmitted to, for example, another cellular phone via the RF circuit 1110, or output to the memory 1120 for further processing.
WiFi belongs to short-distance wireless transmission technology, and the cell phone can help a user to receive and send e-mails, browse webpages, access streaming media and the like through the WiFi module 1170, and provides wireless broadband internet access for the user. Although fig. 11 shows the WiFi module 1170, it is understood that it does not belong to the essential constitution of the handset, and can be omitted entirely as needed within the scope not changing the essence of the invention.
The processor 1180 is a control center of the mobile phone, and is connected to various parts of the whole mobile phone through various interfaces and lines, and executes various functions of the mobile phone and processes data by operating or executing software programs and/or modules stored in the memory 1120 and calling data stored in the memory 1120, thereby performing overall monitoring of the mobile phone. Optionally, processor 1180 may include one or more processing units; optionally, the processor 1180 may integrate an application processor and a modem processor, wherein the application processor mainly handles operating systems, user interfaces, application programs, and the like, and the modem processor mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated within processor 1180.
The mobile phone further includes a power supply 1190 (e.g., a battery) for supplying power to each component, and optionally, the power supply may be logically connected to the processor 1180 through a power management system, so that functions of managing charging, discharging, power consumption management, and the like are implemented through the power management system.
Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which are not described herein.
In the embodiment of the present application, the processor 1180 included in the terminal further has a function of executing the steps of the page processing method.
Referring to fig. 12, fig. 12 is a schematic structural diagram of a server according to an embodiment of the present invention, where the server 1200 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 1222 (e.g., one or more processors) and a memory 1232, and one or more storage media 1230 (e.g., one or more mass storage devices) storing an application program 1242 or data 1244. Memory 1232 and storage media 1230 can be, among other things, transient storage or persistent storage. The program stored in the storage medium 1230 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 1222 may be configured to communicate with the storage medium 1230, to execute a series of instruction operations in the storage medium 1230 on the server 1200.
The server 1200 may also include one or more power supplies 1226, one or more wired or wireless network interfaces 1250, one or more input-output interfaces 1258, and/or one or more operating systems 1241, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, etc.
The steps performed by the management apparatus in the above-described embodiment may be based on the server configuration shown in fig. 12.
An embodiment of the present application further provides a computer-readable storage medium, in which instructions for corpus expansion are stored, and when the instructions are executed on a computer, the computer is enabled to execute the steps performed by the corpus expansion apparatus in the method described in the foregoing embodiments shown in fig. 3 to 9.
The embodiment of the present application further provides a computer program product including instructions for corpus expansion, which, when running on a computer, causes the computer to perform the steps performed by the corpus expansion apparatus in the method described in the foregoing embodiments shown in fig. 3 to 9.
The embodiment of the present application further provides a corpus expanding system, where the corpus expanding system may include a corpus expanding device in the embodiment described in fig. 10, a terminal device in the embodiment described in fig. 11, or a server described in fig. 12.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a corpus expansion device, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (15)

1. A corpus expansion method, comprising:
acquiring a target corpus;
extracting a first granularity component and a second granularity component in the target corpus, wherein the first granularity component is used for indicating a corpus structure of the target corpus, the second granularity component is used for indicating a corpus object of the target corpus, and the combination of the corpus structure and the corpus object is the target corpus;
determining a structural corpus associated with the first granular component to obtain a structural corpus set;
determining the object linguistic data associated with the second granularity component to obtain an object linguistic data set;
and combining the linguistic data in the structure linguistic data set and the linguistic data in the object linguistic data set to obtain an expanded linguistic data set.
2. The method according to claim 1, wherein the extracting the first granular component and the second granular component in the target corpus comprises:
inputting the target corpus into a granularity classifier to obtain a classification label sequence of the target corpus;
and determining a classification result of each component in the target corpus based on the classification tag sequence to obtain the first granularity component and the second granularity component.
3. The method of claim 2, further comprising:
acquiring training data containing similar sentence pairs;
performing word alignment on the similar sentence pairs to obtain an alignment result;
counting word frequency information in the alignment result;
determining training object components and training structure components in the similar sentence pairs based on the word frequency information;
and carrying out classification labeling on the training object components and the training structure components so as to train the granularity classifier.
4. The method of claim 3, wherein the determining training object components and training structure components in the similar sentence pairs based on the word frequency information comprises:
acquiring a preset hyper-parameter;
determining quantity information of the training object components based on the preset hyper-parameter;
determining the training object components based on the word frequency information and the quantity information;
and taking the part of the similar sentence pair except the training object component as the training structure component.
5. The method according to claim 1, wherein the determining the structural corpus associated with the first granular component to obtain a set of structural corpora comprises:
acquiring a preset question-answer pair of at least one information source;
taking each corpus in the preset question-answer pair as a template corpus to determine corresponding similar corpuses;
determining a first mapping table based on the template corpus and the similar corpus;
and determining the structural language material associated with the first granularity component according to the first mapping table to obtain the structural language material set.
6. The method according to claim 5, wherein the determining the corresponding similar corpora using each corpus in the preset question-answer pair as a template corpus comprises:
determining object components and structural components corresponding to each corpus in the preset question-answer pair;
converting the object components into target identifiers to obtain the template corpus;
and determining the corresponding similar corpus based on the template corpus.
7. The method according to claim 1, wherein the determining the object corpus associated with the second granular component to obtain an object corpus set comprises:
acquiring preset semantic data;
extracting similar object pairs in the preset semantic data to establish a second mapping table;
and determining the object corpus associated with the second granularity component based on the second mapping table to obtain the object corpus set.
8. The method according to claim 7, wherein the extracting similar objects in the preset semantic data to establish a second mapping table comprises:
extracting the similar object pairs in the preset semantic data;
converting the similar object pair based on the language category to obtain a converted object pair, wherein the data volume corresponding to the converted object pair is larger than that corresponding to the similar object pair;
and establishing the second mapping table according to the conversion object pair.
9. The method according to claim 1, wherein the combining based on the corpus in the structural corpus set and the corpus in the object corpus set to obtain an extended corpus set comprises:
determining a structural expansion item based on the corpora in the structural corpus set;
determining at least one object expansion item based on the corpora in the object corpus set;
and arranging and combining the structure expansion items and the object expansion items to obtain the expansion corpus set.
10. The method according to any one of claims 1-9, further comprising:
inputting the extended corpus set into a preset scoring model to obtain an extended corpus score;
and grading the expanded corpora based on a preset threshold value to screen so as to update the expanded corpus set.
11. The method according to any one of claims 1-9, further comprising:
determining reply information corresponding to the target corpus;
and associating the reply information with each item in the extended corpus set to obtain question-answer pair information, wherein the question-answer pair information is used for responding to input feedback reply corpuses of question corpuses.
12. The method according to claim 1, wherein the first granular component is a phrase granular component, the second granular component is a sentence granular component, the question information corresponding to a question-answering task in the target corpus machine question-answering, and the expanded corpus set is used for expanding the question information.
13. A corpus expanding device, comprising:
the acquisition unit is used for acquiring the target corpus;
an extracting unit, configured to extract a first granular component and a second granular component in the target corpus, where the first granular component is used to indicate a corpus structure of the target corpus, the second granular component is used to indicate a corpus object of the target corpus, and a combination of the corpus structure and the corpus object is the target corpus;
a determining unit, configured to determine a structural corpus associated with the first granular component to obtain a structural corpus set;
the determining unit is further configured to determine an object corpus associated with the second granularity component to obtain an object corpus set;
and the expansion unit is used for combining the corpora in the structure corpus set and the corpora in the object corpus set to obtain an expanded corpus set.
14. A computer device, the computer device comprising a processor and a memory:
the memory is used for storing program codes; the processor is configured to execute the method for corpus expansion according to any one of claims 1 to 12 according to instructions in the program code.
15. A computer-readable storage medium having stored therein instructions which, when run on a computer, cause the computer to perform the method of corpus expansion according to any one of the preceding claims 1 to 12.
CN202011393376.5A 2020-12-02 Corpus expansion method and related device Active CN113392631B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011393376.5A CN113392631B (en) 2020-12-02 Corpus expansion method and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011393376.5A CN113392631B (en) 2020-12-02 Corpus expansion method and related device

Publications (2)

Publication Number Publication Date
CN113392631A true CN113392631A (en) 2021-09-14
CN113392631B CN113392631B (en) 2024-04-26

Family

ID=

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070073533A1 (en) * 2005-09-23 2007-03-29 Fuji Xerox Co., Ltd. Systems and methods for structural indexing of natural language text
US20110257963A1 (en) * 2006-10-10 2011-10-20 Konstantin Zuev Method and system for semantic searching
US20170091312A1 (en) * 2015-09-24 2017-03-30 International Business Machines Corporation Generating natural language dialog using a questions corpus
WO2017111835A1 (en) * 2015-12-26 2017-06-29 Intel Corporation Binary linear classification
CN109214006A (en) * 2018-09-18 2019-01-15 中国科学技术大学 The natural language inference method that the hierarchical semantic of image enhancement indicates
CN110852109A (en) * 2019-11-11 2020-02-28 腾讯科技(深圳)有限公司 Corpus generating method, corpus generating device, and storage medium
US20200089768A1 (en) * 2018-09-19 2020-03-19 42 Maru Inc. Method, system, and computer program for artificial intelligence answer
CN111125323A (en) * 2019-11-21 2020-05-08 腾讯科技(深圳)有限公司 Chat corpus labeling method and device, electronic equipment and storage medium
CN111708873A (en) * 2020-06-15 2020-09-25 腾讯科技(深圳)有限公司 Intelligent question answering method and device, computer equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070073533A1 (en) * 2005-09-23 2007-03-29 Fuji Xerox Co., Ltd. Systems and methods for structural indexing of natural language text
US20110257963A1 (en) * 2006-10-10 2011-10-20 Konstantin Zuev Method and system for semantic searching
US20170091312A1 (en) * 2015-09-24 2017-03-30 International Business Machines Corporation Generating natural language dialog using a questions corpus
WO2017111835A1 (en) * 2015-12-26 2017-06-29 Intel Corporation Binary linear classification
CN109214006A (en) * 2018-09-18 2019-01-15 中国科学技术大学 The natural language inference method that the hierarchical semantic of image enhancement indicates
US20200089768A1 (en) * 2018-09-19 2020-03-19 42 Maru Inc. Method, system, and computer program for artificial intelligence answer
CN110852109A (en) * 2019-11-11 2020-02-28 腾讯科技(深圳)有限公司 Corpus generating method, corpus generating device, and storage medium
CN111125323A (en) * 2019-11-21 2020-05-08 腾讯科技(深圳)有限公司 Chat corpus labeling method and device, electronic equipment and storage medium
CN111708873A (en) * 2020-06-15 2020-09-25 腾讯科技(深圳)有限公司 Intelligent question answering method and device, computer equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘鹏;宗成庆;: "基于短语模糊匹配和句子扩展的统计翻译方法", 中文信息学报, no. 05, 15 September 2009 (2009-09-15), pages 40 - 46 *

Similar Documents

Publication Publication Date Title
KR102534721B1 (en) Method, apparatus, device and storage medium for training model
CN109344291B (en) Video generation method and device
CN107210035B (en) Generation of language understanding systems and methods
CN110162770A (en) A kind of word extended method, device, equipment and medium
CN111931501B (en) Text mining method based on artificial intelligence, related device and equipment
CN104217717A (en) Language model constructing method and device
CN110795528A (en) Data query method and device, electronic equipment and storage medium
CN110334347A (en) Information processing method, relevant device and storage medium based on natural language recognition
CN111597804B (en) Method and related device for training entity recognition model
CN110795538B (en) Text scoring method and related equipment based on artificial intelligence
CN110852109A (en) Corpus generating method, corpus generating device, and storage medium
CN112214605A (en) Text classification method and related device
CN108776677B (en) Parallel sentence library creating method and device and computer readable storage medium
CN110321559A (en) Answer generation method, device and the storage medium of natural language problem
CN114328852A (en) Text processing method, related device and equipment
CN114117056B (en) Training data processing method and device and storage medium
CN110597957B (en) Text information retrieval method and related device
CN113704008A (en) Anomaly detection method, problem diagnosis method and related products
CN113392631B (en) Corpus expansion method and related device
CN112862021B (en) Content labeling method and related device
CN112036135B (en) Text processing method and related device
CN113392631A (en) Corpus expansion method and related device
CN113569043A (en) Text category determination method and related device
CN110781274A (en) Question-answer pair generation method and device
CN112232048A (en) Table processing method based on neural network and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40052299

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant