CN115391539A - Corpus data processing method and device and electronic equipment - Google Patents
Corpus data processing method and device and electronic equipment Download PDFInfo
- Publication number
- CN115391539A CN115391539A CN202211052774.XA CN202211052774A CN115391539A CN 115391539 A CN115391539 A CN 115391539A CN 202211052774 A CN202211052774 A CN 202211052774A CN 115391539 A CN115391539 A CN 115391539A
- Authority
- CN
- China
- Prior art keywords
- corpus
- group
- target
- linguistic data
- similarity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/01—Customer relationship services
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/02—Banking, e.g. interest calculation or account maintenance
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Engineering & Computer Science (AREA)
- Accounting & Taxation (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Databases & Information Systems (AREA)
- Finance (AREA)
- General Business, Economics & Management (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Mathematical Physics (AREA)
- Human Computer Interaction (AREA)
- Technology Law (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application discloses a corpus data processing method, a corpus data processing device and electronic equipment, which are applied to the field of big data, wherein the method comprises the following steps: acquiring a system database corresponding to the application system, wherein the application system can acquire the corpus corresponding to the query request in the system database aiming at the query request; the system database at least comprises a plurality of corpus groups; each corpus group comprises at least one target corpus; aiming at each corpus group, dividing the target corpus into a first corpus set and a second corpus set according to the corpus hit rate of the target corpus; the first corpus set comprises at least one first corpus, the second corpus set comprises at least one second corpus, and the corpus hit rate of the first corpus is greater than that of the second corpus; and adding or deleting the linguistic data in the second linguistic data set corresponding to the second linguistic data group by using the linguistic data in the second linguistic data set corresponding to the first linguistic data group, wherein the first linguistic data group and the second linguistic data group have an association relation.
Description
Technical Field
The present application relates to the field of data processing technologies, and in particular, to a corpus data processing method and apparatus, and an electronic device.
Background
In the bank customer service system, after receiving the questions provided by the client, the appropriate response corpora are screened out through the corpus tag, so that the consultation service is provided for the client.
However, these response corpora require manual labeling, which results in poor user experience of the bank customer service system.
Disclosure of Invention
In view of this, the present application provides a corpus data processing method, apparatus and electronic device, so as to solve the technical problem that the use experience of the application system corresponding to the current system database is poor.
A corpus data processing method, the method comprising:
the method comprises the steps that a system database corresponding to an application system is obtained, and the application system can obtain linguistic data corresponding to a query request in the system database aiming at the query request; the system database at least comprises a plurality of corpus groups, and each corpus group corresponds to a service type; each corpus group comprises at least one target corpus;
aiming at each corpus group, dividing the target corpus into a first corpus set and a second corpus set according to the corpus hit rate of the target corpus; the first corpus set comprises at least one first corpus, the second corpus set comprises at least one second corpus, and the corpus hit rate of the first corpus is greater than that of the second corpus;
and adding or deleting the linguistic data in the second linguistic data set corresponding to the second linguistic data group by using the linguistic data in the second linguistic data set corresponding to the first linguistic data group, wherein the first linguistic data group and the second linguistic data group have an association relation.
Preferably, in the method, the associating relationship between the first corpus group and the second corpus group includes:
the corpus group similarity between the first corpus group and the second corpus group is greater than the corpus group similarity between the first corpus group and other corpus groups in the plurality of corpus groups;
and the corpus group similarity between the first corpus group and the second corpus group is greater than the corpus group similarity between the second corpus group and other corpus groups in the plurality of corpus groups.
Preferably, in the method, the corpus group similarity between the first corpus group and the second corpus group is:
the first set similarity and the second set similarity are weighted by using respective corresponding weights to calculate average to obtain overall similarity;
wherein the first set similarity is a similarity between a first corpus set in the first corpus group and a first corpus set in the second corpus group; the second set similarity is a similarity between a second corpus set in the first corpus set and a second corpus set in the second corpus set.
Preferably, the adding the corpus in the second corpus set corresponding to the second corpus group by using the corpus in the second corpus set corresponding to the first corpus group includes:
acquiring a third corpus of which the corpus hit rate is greater than or equal to a first threshold value from second corpuses contained in a second corpus set corresponding to the first corpus group; and adding the third corpus into a second corpus set corresponding to a second corpus group.
Preferably, the deleting the corpus in the second corpus set corresponding to the second corpus group by using the corpus in the second corpus set corresponding to the first corpus group includes:
deleting a fourth corpus from a second corpus set corresponding to a second corpus group, where the fourth corpus is a corpus added to the second corpus group from the second corpus set corresponding to the first corpus group under the condition that a corpus hit rate is greater than or equal to a first threshold, and the corpus hit rate of the fourth corpus in the second corpus set corresponding to the first corpus group is reduced from being greater than or equal to the first threshold to being less than the first threshold.
The above method, preferably, further comprises:
acquiring a first associated corpus from a second corpus set corresponding to the first corpus group;
acquiring a second associated corpus from a second corpus set corresponding to the second corpus group; the second associated corpus and the first associated corpus are derived from target source documents, and the quantity of the target corpuses generated in a plurality of source documents corresponding to the system database in the target source documents meets target screening conditions;
moving the first associated corpus to a second corpus set corresponding to the second corpus group;
and moving the second associated linguistic data to a second linguistic data set corresponding to the first linguistic data group.
The above method, preferably, further comprises:
acquiring at least one new corpus;
performing word segmentation on the new corpus to obtain corpus keywords of each new corpus;
acquiring keyword repetition degrees of the corpus keywords in a first corpus set and a second corpus set corresponding to the corpus group corresponding to the new corpus respectively;
and adding the new language material to a first language material set or a second language material set corresponding to the language material group corresponding to the new language material according to the keyword repetition degree.
The above method, preferably, further comprises:
acquiring a target query request, wherein the target query request at least comprises query keywords;
performing corpus query in a first corpus set and a second corpus set corresponding to the target corpus set respectively by using the query keyword to obtain a first query result and a second query result; the target corpus is a corpus corresponding to the service type and the query keyword;
sorting the linguistic data in the first query result and the linguistic data in the second query result according to the similarity of the linguistic data to obtain a sorting result;
and outputting the language material in the first query result and the language material in the second query result according to the sequencing result.
A corpus data processing apparatus, comprising:
the system comprises a data obtaining unit, a query unit and a query unit, wherein the data obtaining unit is used for obtaining a system database corresponding to an application system, and the application system can obtain corpora corresponding to a query request in the system database aiming at the query request; the system database at least comprises a plurality of corpus groups, and each corpus group corresponds to a service type respectively; each corpus group comprises at least one target corpus;
a corpus dividing unit, configured to divide the target corpus into a first corpus set and a second corpus set according to the corpus hit rate of the target corpus for each corpus group; the first corpus set comprises at least one first corpus, the second corpus set comprises at least one second corpus, and the corpus hit rate of the first corpus is greater than that of the second corpus;
and the corpus processing unit is used for adding or deleting the corpus in the second corpus set corresponding to the second corpus group by using the corpus in the second corpus set corresponding to the first corpus group, and the first corpus group and the second corpus group have an association relation.
An electronic device, comprising:
a memory for storing a computer program and data generated by the execution of the computer program;
a processor for executing the computer program to implement: the method comprises the steps that a system database corresponding to an application system is obtained, and the application system can obtain corpora corresponding to query requests in the system database aiming at the query requests; the system database at least comprises a plurality of corpus groups, and each corpus group corresponds to a service type; each corpus group comprises at least one target corpus; aiming at each corpus group, dividing the target corpus into a first corpus set and a second corpus set according to the corpus hit rate of the target corpus; the first corpus set comprises at least one first corpus, the second corpus set comprises at least one second corpus, and the corpus hit rate of the first corpus is greater than that of the second corpus; and adding or deleting the linguistic data in the second linguistic data set corresponding to the second linguistic data group by using the linguistic data in the second linguistic data set corresponding to the first linguistic data group, wherein the first linguistic data group and the second linguistic data group have an association relation.
According to the above scheme, in the corpus data processing method, the apparatus and the electronic device provided by the present application, the target corpus in the system database is divided into two corpus groups according to the corpus hit rate, and then the corpus in one of the corpus groups is used to add or delete the corpus in the corpus group corresponding to the associated service type, so as to adjust the corpus in the corpus group corresponding to each service type in the system database, thereby improving the use experience of the application system corresponding to the system database.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
Fig. 1 is a flowchart of a method for processing corpus data according to an embodiment of the present application;
fig. 2, fig. 3 and fig. 4 are partial flowcharts of a method for processing corpus data according to an embodiment of the present disclosure;
fig. 5 is a schematic structural diagram of a corpus data processing apparatus according to a second embodiment of the present application;
fig. 6 and fig. 7 are schematic structural diagrams of a corpus data processing apparatus according to a second embodiment of the present application;
fig. 8 is a schematic structural diagram of an electronic device according to a third embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Referring to fig. 1, a flowchart of an implementation of a corpus data processing method provided in an embodiment of the present application is shown, where the method may be applied to an electronic device capable of performing data processing, such as a computer or a server. The technical scheme in the embodiment is mainly used for improving the use experience of the application system corresponding to the system database.
Specifically, the method in this embodiment may include the following steps:
step 101: and obtaining a system database corresponding to the application system.
The application system can acquire the linguistic data corresponding to the query request in the system database aiming at the query request. For example, a corpus satisfying the query condition with the keyword is queried in the system data by using the keyword in the query request and output.
Specifically, the system database at least includes a plurality of corpus groups, and each corpus group corresponds to a service type, such as a mobile banking type, a loan type, a personal online banking type, and an account management type. Each corpus group comprises at least one target corpus.
Step 102: and aiming at each corpus group, dividing the target corpus into a first corpus set and a second corpus set according to the corpus hit rate of the target corpus.
In this embodiment, the corpus hit rate of each target corpus may be counted in the historical usage process of the system database, where the corpus hit rate refers to the number of query requests for which the keywords and the target corpus satisfy the query condition, that is, the number of times that the target corpus is queried to satisfy the query condition. Based on this, in this embodiment, a corresponding first corpus set and a corresponding second corpus set are created for each corpus group according to the corpus hit rate of each target corpus, so that the first corpus set includes at least one first corpus, the second corpus set includes at least one second corpus, and the corpus hit rate of the first corpus is greater than the corpus hit rate of the second corpus.
Specifically, in this embodiment, for each corpus group, the target corpus with the corpus hit rate greater than or equal to the hit threshold is divided into the first corpus set, and the target corpus with the corpus hit rate less than the hit threshold is divided into the second corpus set.
Step 103: and adding or deleting the linguistic data in the second linguistic data set corresponding to the second linguistic data group by using the linguistic data in the second linguistic data set corresponding to the first linguistic data group.
And the first corpus group and the second corpus group have an association relation. That is to say, in this embodiment, the corpora are added or deleted between the second corpus sets corresponding to the corpus groups having the association relationship, so as to enrich the corpora in the corpus sets.
It can be known from the foregoing solutions that, in a method for processing corpus data provided in this embodiment of the present application, first, a target corpus in a system database is divided into two corpus groups according to a corpus hit rate, and then a corpus in one of the corpus groups is used to add or delete a corpus in a corpus group corresponding to an associated service type, so as to adjust a corpus in the corpus group corresponding to each service type in the system database, thereby improving use experience of an application system corresponding to the system database.
In an implementation manner, the first corpus group and the second corpus group have an association relationship, which may specifically be:
the corpus group similarity between the first corpus group and the second corpus group is greater than the corpus group similarity between the first corpus group and other corpus groups in the plurality of corpus groups;
and the corpus group similarity between the first corpus group and the second corpus group is greater than the corpus group similarity between the second corpus group and other corpus groups in the plurality of corpus groups.
Specifically, in this embodiment, the corpus group similarity between any two corpus groups in all corpus groups is counted, then the corpus groups are sorted according to the corpus group similarity, and an association relationship is established between corpus groups whose corpus group similarity satisfies a similarity condition, for example, the corpus group similarity is greater than or equal to a similarity threshold.
The corpus similarity between the first corpus and the second corpus may specifically be: and the first set similarity and the second set similarity are weighted by using respective corresponding weights to obtain the overall similarity through averaging.
The similarity of the first set is the similarity between the first corpus set in the first corpus group and the first corpus set in the second corpus group; the second set similarity is a similarity between a second corpus set in the first corpus group and a second corpus set in the second corpus group.
It should be noted that the weight corresponding to the first corpus set and the weight corresponding to the second corpus set may be set according to requirements.
In an implementation manner, when the corpora in the second corpus corresponding to the first corpus group are used to add the corpora in the second corpus corresponding to the second corpus group in step 103, the following manner may be implemented:
acquiring a third corpus of which the corpus hit rate is greater than or equal to a first threshold value from second corpuses contained in a second corpus set corresponding to the first corpus group; and adding the third corpus into a second corpus set corresponding to the second corpus group.
That is to say, in this embodiment, the corpus hit rate of the second corpus in the second corpus set corresponding to each corpus group is counted, and then the third corpuses with the corpus hit rate greater than or equal to the first threshold are screened out, and then the third corpuses are added to the second corpus set corresponding to the corpus group with the association relationship, so as to enrich the corpuses.
It should be noted that, in this embodiment, a corpus tag corresponding to the first corpus may be set for a third corpus added to the second corpus set corresponding to the second corpus group, so as to indicate that the third corpus is derived from the first corpus group, so that the third corpus is deleted from the second corpus set corresponding to the second corpus group when the corpus hit rate of the third corpus in the second corpus set corresponding to the first corpus group is reduced to be smaller than the first threshold.
In an implementation manner, when the corpus in the second corpus set corresponding to the second corpus set is deleted by using the corpus in the second corpus set corresponding to the first corpus set in step 103, the following manner may be implemented:
deleting a fourth corpus from a second corpus set corresponding to the second corpus group, wherein the fourth corpus is a corpus added to the second corpus group from the second corpus set corresponding to the first corpus group under the condition that the corpus hit rate is larger than or equal to a first threshold, and the corpus hit rate of the fourth corpus in the second corpus set corresponding to the first corpus group is reduced from larger than or equal to the first threshold to smaller than the first threshold.
That is to say, in this embodiment, after the fourth corpus is added to the second corpus in the case that the corpus hit rate in the second corpus set corresponding to the first corpus is greater than or equal to the first threshold, statistics is continuously performed on the corpus hit rate of the fourth corpus in the second corpus set corresponding to the first corpus set, and if the corpus hit rate of the fourth corpus is reduced to be less than the first threshold, the fourth corpus may be deleted from the second corpus set corresponding to the second corpus set.
In one implementation, the method in this embodiment may further include the following steps, as shown in fig. 2:
step 201: and acquiring a first associated corpus in a second corpus set corresponding to the first corpus group, and acquiring a second associated corpus in a second corpus set corresponding to the second corpus group.
The second associated corpus and the first associated corpus are derived from a target source document, and the number of target corpuses generated in a plurality of source documents corresponding to the system database in the target source document meets a target screening condition.
The target screening conditions here may be: the number of target corpora generated in a plurality of source documents corresponding to the system database in the target source document is the largest, or the target screening condition may be: and the quantity of the target language materials generated in a plurality of source documents corresponding to the system database in the target source document is ranked from large to small in the top Q, wherein Q is a positive integer greater than or equal to 2.
Step 202: and moving the first associated linguistic data to a second linguistic data set corresponding to the second linguistic data group, and moving the second associated linguistic data to a second linguistic data set corresponding to the first linguistic data group.
That is to say, in this embodiment, the second corpora in each second corpus set are compared with the second corpora in the second corpus set having an association relationship one by one, so as to determine the second corpora from the same source document, the source documents are counted, and the target source document with the largest number of second corpora in the second corpus set is counted, so that the corpora corresponding to the target source document in the second corpus set having an association relationship are exchanged.
In one implementation, the method in this embodiment may further include the following steps, as shown in fig. 3:
step 301: and acquiring at least one new corpus.
The new corpus may be a newly obtained corpus or a corpus exceeding a preset collection space of the first corpus collection and the second corpus collection.
Step 302: and performing word segmentation on the new linguistic data to obtain a linguistic data keyword of each new linguistic data.
Specifically, in this embodiment, a word segmentation algorithm may be used to perform word segmentation processing on the new corpus to obtain corpus keywords.
Step 303: and acquiring the keyword repetition degrees of the corpus keywords in the first corpus set and the second corpus set corresponding to the corpus group corresponding to the new corpus respectively.
Specifically, in this embodiment, the keyword repetition degree of the corpus keyword in the first corpus set corresponding to the service type to which the corpus keyword belongs is calculated, for example, the keyword similarity between the corpus keyword and the first corpus in the first corpus set corresponding to the service type to which the corpus keyword belongs is compared, and the number of words with the keyword similarity greater than or equal to the corresponding threshold is used as the keyword repetition degree;
in addition, in this embodiment, the keyword repetition degree of the corpus keyword in the second corpus set corresponding to the service type to which the corpus keyword belongs is calculated, for example, the keyword similarity between the corpus keyword and the second corpus in the second corpus set corresponding to the service type to which the corpus keyword belongs is compared, and the number of words with the keyword similarity being greater than or equal to the corresponding threshold is used as the keyword repetition degree. The keyword repetition degree can be understood as the frequency (frequency) of occurrence of the keyword.
Step 304: and adding the new language material into a first language material set or a second language material set corresponding to the language material group corresponding to the new language material according to the keyword repetition degree.
For example, a new corpus with a keyword repetition degree greater than or equal to a repetition degree threshold is added to a first corpus set corresponding to the business type to which the new corpus belongs, and a new expectation with a keyword repetition degree less than the repetition degree threshold is added to a second corpus set corresponding to the business type to which the new corpus set belongs.
In one implementation, the method in this embodiment may further include the following steps, as shown in fig. 4:
step 401: and acquiring a target query request, wherein the target query request at least comprises query keywords.
Step 402: and performing corpus query in a first corpus set and a second corpus set corresponding to the target corpus set respectively by using the query keyword to obtain a first query result and a second query result.
The target language material group is a language material group corresponding to the service type and the query keyword. For example, a first corpus set corresponding to a target corpus set corresponding to a corresponding service type is queried for corpora having a similarity greater than or equal to a corresponding threshold with respect to a query keyword, using the query keyword to obtain a first query result, where the first query result includes one or more corpora, and a second corpus set corresponding to a target corpus set corresponding to a corresponding service type is queried for corpora having a similarity greater than or equal to a corresponding threshold with respect to the query keyword, using the query keyword to obtain a second query result, where the second query result includes one or more corpora.
Step 403: and sequencing the linguistic data in the first query result and the linguistic data in the second query result according to the linguistic data similarity so as to obtain a sequencing result.
Specifically, in this embodiment, the corpus in the first query result may be first ranked before the corpus in the second query result, the corpus in the first query result is ranked from large to small according to the keyword similarity between the corpus and the query keyword, the corpus in the second query result is ranked from large to small according to the keyword similarity between the corpus and the query keyword, and then, with the corpus in the first query result as a reference, the ranking position of the corpus in the second query result, which has a corpus similarity with the corpus in the first query result greater than or equal to a corresponding threshold value, is adjusted to a position adjacent to the corresponding corpus in the first query result, that is, the ranking position of the corpus in the second query result, which has a higher corpus similarity with the corpus in the first query result, is adjusted forward.
Step 404: and outputting the language material in the first query result and the language material in the second query result according to the sequencing result.
For example, the corpora in the first query result and the corpora in the second query result are output in the order from front to back in the ranking result, so as to facilitate use.
Referring to fig. 5, a schematic structural diagram of a apparatus for processing corpus data according to a second embodiment of the present application is provided, where the apparatus may be configured in an electronic device capable of performing data processing, such as a computer or a server. The technical scheme in the embodiment is mainly used for improving the use experience of the application system corresponding to the system database.
Specifically, the apparatus in this embodiment may include the following units:
a data obtaining unit 501, configured to obtain a system database corresponding to an application system, where the application system can obtain, in the system database, a corpus corresponding to a query request according to the query request; the system database at least comprises a plurality of corpus groups, and each corpus group corresponds to a service type respectively; each corpus group comprises at least one target corpus;
a corpus dividing unit 502, configured to divide the target corpus into a first corpus set and a second corpus set according to the corpus hit rate of the target corpus for each corpus group; the first corpus set comprises at least one first corpus, the second corpus set comprises at least one second corpus, and the corpus hit rate of the first corpus is greater than that of the second corpus;
the corpus processing unit 503 is configured to add or delete a corpus in a second corpus corresponding to a second corpus group by using a corpus in the second corpus set corresponding to a first corpus group, where the first corpus group and the second corpus group have an association relationship.
It can be known from the foregoing solution that, in the apparatus for processing corpus data provided in the second embodiment of the present application, first, a target corpus in a system database is divided into two corpus groups according to a corpus hit rate, and then a corpus in one of the corpus groups is used to add or delete a corpus in a corpus group corresponding to an associated service type, so as to adjust a corpus in the corpus group corresponding to each service type in the system database, thereby improving use experience of an application system corresponding to the system database.
In one implementation, the associating relationship between the first corpus group and the second corpus group includes: the corpus group similarity between the first corpus group and the second corpus group is greater than the corpus group similarity between the first corpus group and other corpus groups in the plurality of corpus groups; and the corpus group similarity between the first corpus group and the second corpus group is greater than the corpus group similarity between the second corpus group and other corpus groups in the plurality of corpus groups.
In one implementation, the corpus group similarity between the first corpus group and the second corpus group is: the first set similarity and the second set similarity are weighted by using respective corresponding weights to calculate average to obtain overall similarity; wherein the first set similarity is a similarity between a first corpus set in the first corpus group and a first corpus set in the second corpus group; the second set similarity is a similarity between a second corpus set in the first corpus set and a second corpus set in the second corpus set.
In one implementation, when the corpus in the second corpus set corresponding to the second corpus group is added by using the corpus in the second corpus set corresponding to the first corpus group, the corpus processing unit 503 is specifically configured to: acquiring a third corpus of which the corpus hit rate is greater than or equal to a first threshold value from second corpuses contained in a second corpus set corresponding to the first corpus group; and adding the third corpus to a second corpus set corresponding to a second corpus group.
In an implementation manner, when the corpus in the second corpus set corresponding to the second corpus group is deleted by using the corpus in the second corpus set corresponding to the first corpus group, the corpus processing unit 503 is specifically configured to: deleting a fourth corpus from a second corpus set corresponding to a second corpus group, where the fourth corpus is a corpus added to the second corpus group from the second corpus set corresponding to the first corpus group under the condition that a corpus hit rate is greater than or equal to a first threshold, and the corpus hit rate of the fourth corpus in the second corpus set corresponding to the first corpus group is reduced from being greater than or equal to the first threshold to being less than the first threshold.
In one implementation, the apparatus in this embodiment may further include the following units, as shown in fig. 6:
a corpus moving unit 504, configured to obtain a first associated corpus from a second corpus set corresponding to the first corpus group; acquiring a second associated corpus from a second corpus set corresponding to the second corpus group; the second associated corpus and the first associated corpus are derived from target source documents, and the quantity of the target corpuses generated in a plurality of source documents corresponding to the system database in the target source documents meets target screening conditions; moving the first associated corpus to a second corpus set corresponding to the second corpus group; and moving the second associated linguistic data to a second linguistic data set corresponding to the first linguistic data group.
In one implementation, the corpus partitioning unit 502 is further configured to: obtaining at least one new corpus; performing word segmentation on the new corpus to obtain corpus keywords of each new corpus; acquiring keyword repetition degrees of the corpus keywords in a first corpus set and a second corpus set corresponding to the corpus group corresponding to the new corpus respectively; and adding the new language material to a first language material set or a second language material set corresponding to the language material group corresponding to the new language material according to the keyword repetition degree.
In one implementation, the apparatus in this embodiment may further include the following units, as shown in fig. 7:
a query processing unit 505, configured to obtain a target query request, where the target query request at least includes a query keyword; performing corpus query in a first corpus set and a second corpus set corresponding to the target corpus set respectively by using the query keyword to obtain a first query result and a second query result; the target corpus is a corpus corresponding to the service type and the query keyword; sorting the corpus in the first query result and the corpus in the second query result according to corpus similarity to obtain a sorting result; and outputting the language material in the first query result and the language material in the second query result according to the sequencing result.
It should be noted that, for the specific implementation of each unit in the present embodiment, reference may be made to the corresponding content in the foregoing, and details are not described here.
Referring to fig. 8, a schematic structural diagram of an electronic device provided in a third embodiment of the present application is shown, where the electronic device may include:
a memory 801 for storing a computer program and data generated by the execution of the computer program;
a processor 802 for executing the computer program to implement: the method comprises the steps that a system database corresponding to an application system is obtained, and the application system can obtain linguistic data corresponding to a query request in the system database aiming at the query request; the system database at least comprises a plurality of corpus groups, and each corpus group corresponds to a service type respectively; each corpus group comprises at least one target corpus; aiming at each corpus group, dividing the target corpus into a first corpus set and a second corpus set according to the corpus hit rate of the target corpus; the first corpus set comprises at least one first corpus, the second corpus set comprises at least one second corpus, and the corpus hit rate of the first corpus is greater than that of the second corpus; and adding or deleting the linguistic data in the second linguistic data set corresponding to the second linguistic data group by using the linguistic data in the second linguistic data set corresponding to the first linguistic data group, wherein the first linguistic data group and the second linguistic data group have an association relation.
According to the above scheme, in the electronic device provided in the third embodiment of the present application, the target corpus in the system database is divided into two corpus groups according to the corpus hit rate, and then the corpus in one corpus group is used to add or delete the corpus in the corpus group corresponding to the associated service type, so as to adjust the corpus in the corpus group corresponding to each service type in the system database, thereby improving the use experience of the application system corresponding to the system database.
Taking a customer service module of a mobile phone bank as an example, for the convenience of customer selection and the improvement of the accuracy of question and answer, a plurality of service modules are built in, for example, the mobile phone bank, loan, personal bank, account management and the like, different questions and answer corpora are provided under each service tag, but the corpora tags are basically manually marked, and the corpora tags cannot be timely adjusted according to the use condition, so that the accuracy of the customer service module and the corpora use experience are influenced to a certain extent.
In view of this, a mobile banking fuzzy boundary classification corpus adjustment scheme is established in the application, in which the corpus is classified and adjusted by performing fuzzy area storage and data association of multiple fuzzy areas on a specific corpus, and the corpus is dynamically copied or classified and adjusted according to the use of a client, so that the corpus search accuracy is improved, the corpus classification automation is promoted, and the use experience of mobile banking customer service is improved. Specifically, the present application mainly includes the following three parts:
1. filling a fuzzy area: and sorting the hit frequency of the corpus based on the corpus tag of the database, and selecting a specific corpus to fill a fuzzy area.
2. Fuzzy area association: and closing a plurality of fuzzy areas corresponding to the corpora under the plurality of classification labels, and ensuring that the corpora are spread among the plurality of areas.
3. And (3) corpus optimization and adjustment: and monitoring the corpora of the multiple classifications, and performing cross-classification adjustment and corpus regression on the corpora.
The specific scheme is as follows:
firstly, in the service process of a customer service module of a mobile phone bank, the corpus data in a system database is screened according to a classification label (service type), the obtained corpus data is subjected to interval division according to the hit rate data counted by the front end of the mobile phone bank, the corpus data which is larger than a system threshold value a (hit threshold value) is merged and placed with a conventional data area (a first corpus set, also called a conventional area), and the corpus which is smaller than the system threshold value a is subjected to fuzzy data area (a second corpus set, also called a fuzzy area) storage.
And setting the size of a fuzzy area according to a certain proportion of the currently classified data amount for each classified area corpus data, segmenting the corpus for the data beyond the fuzzy area, calculating the repetition degree of keywords of the fuzzy area corpus, moving the data with low repetition degree to a conventional data area, and filling the rest data into the fuzzy area.
Secondly, establishing an incidence relation for each fuzzy area, calculating the text similarity of the data of the conventional areas and the text similarity of the data of the fuzzy areas of a plurality of classifications, carrying out weighted average on the text similarity and the text similarity to obtain the overall similarity of each classification corpus, sequencing the overall similarity, and establishing the fuzzy area incidence relation of adjacent classification corpora.
And exchanging data for the related fuzzy area, namely associating the fuzzy area M with the fuzzy area N, and exchanging part of data in the fuzzy area M into the fuzzy area N at the moment, wherein the specific exchanging method comprises the following steps: comparing the corpora in the fuzzy area with the corpora classified in association one by one, calculating the inclusion relation of the service source documents of the current corpora, and exchanging the data with the most associated source documents in the fuzzy area.
Finally, when the mobile banking customer carries out classified data retrieval application, the corpora are matched according to the following steps:
1. carrying out client question and answer retrieval on the conventional data area of the classified linguistic data;
2. performing question and answer retrieval on the corpus of the fuzzy area;
3. performing retrieval similarity addition on the corpora appearing in the step 1 and the step 2 at the same time, and returning the result after secondary sequencing;
4. recording retrieval use of a client, performing offline reminding on the corpus which is smaller than a system threshold value b in a conventional data area, recording use times of a plurality of categories on the corpus data of a hit fuzzy area, copying the data of the fuzzy area into two related classification areas if the use frequencies of the categories are all larger than the system threshold value, simultaneously establishing a main-standby relationship of the two corpuses in the fuzzy area, and adjusting the main-standby relationship when the condition of the fuzzy area corpus is not met in subsequent use, and regressing the corpuses.
Therefore, the scheme for adjusting the language materials classified by the fuzzy boundary of the mobile phone bank can set the fuzzy regions for a plurality of classified language materials, establish the association among the fuzzy regions, and monitor the use of the language materials in the using process of the language materials so as to adjust the classification of the language materials.
The corpus data processing method, the corpus data processing device and the electronic equipment can be used for big data or other fields, for example, a corpus retrieval scene in the big data field. Other fields are any fields other than the financial field, for example, the distributed field, the cloud computing field, the artificial intelligence field, the internet of things field. The above description is only an example, and does not limit the application field of the name of the invention provided by the present invention.
The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. The device disclosed by the embodiment corresponds to the method disclosed by the embodiment, so that the description is simple, and the relevant points can be referred to the method part for description.
Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative components and steps have been described above generally in terms of their functionality in order to clearly illustrate this interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may reside in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
Claims (10)
1. A corpus data processing method, the method comprising:
the method comprises the steps that a system database corresponding to an application system is obtained, and the application system can obtain corpora corresponding to query requests in the system database aiming at the query requests; the system database at least comprises a plurality of corpus groups, and each corpus group corresponds to a service type respectively; each corpus group comprises at least one target corpus;
aiming at each corpus group, dividing the target corpus into a first corpus set and a second corpus set according to the corpus hit rate of the target corpus; the first corpus set comprises at least one first corpus, the second corpus set comprises at least one second corpus, and the corpus hit rate of the first corpus is greater than that of the second corpus;
and adding or deleting the linguistic data in the second linguistic data set corresponding to the second linguistic data group by using the linguistic data in the second linguistic data set corresponding to the first linguistic data group, wherein the first linguistic data group and the second linguistic data group have an association relation.
2. The method of claim 1, wherein the first corpus group and the second corpus group have an associative relationship therebetween, comprising:
the corpus group similarity between the first corpus group and the second corpus group is greater than the corpus group similarity between the first corpus group and other corpus groups in the plurality of corpus groups;
and the corpus group similarity between the first corpus group and the second corpus group is greater than the corpus group similarity between the second corpus group and other corpus groups in the plurality of corpus groups.
3. The method of claim 2, wherein the corpus group similarity between the first corpus group and the second corpus group is:
the first set similarity and the second set similarity are weighted by using respective corresponding weights to calculate average to obtain overall similarity;
wherein the first set similarity is a similarity between a first corpus set in the first corpus group and a first corpus set in the second corpus group; the second set similarity is a similarity between a second corpus set in the first corpus set and a second corpus set in the second corpus set.
4. The method according to claim 1 or 2, wherein adding the corpus in the second corpus corresponding to the second corpus group using the corpus in the second corpus set corresponding to the first corpus group comprises:
acquiring a third corpus of which the corpus hit rate is greater than or equal to a first threshold value from second corpuses contained in a second corpus set corresponding to the first corpus group; and adding the third corpus into a second corpus set corresponding to a second corpus group.
5. The method according to claim 1 or 2, wherein deleting the corpus in the second corpus corresponding to the second corpus group using the corpus in the second corpus set corresponding to the first corpus group comprises:
deleting a fourth corpus from a second corpus set corresponding to a second corpus group, where the fourth corpus is a corpus added to the second corpus group from the second corpus set corresponding to the first corpus group under the condition that a corpus hit rate is greater than or equal to a first threshold, and the corpus hit rate of the fourth corpus in the second corpus set corresponding to the first corpus group is reduced from being greater than or equal to the first threshold to being less than the first threshold.
6. The method according to claim 1 or 2, characterized in that the method further comprises:
acquiring a first associated corpus from a second corpus set corresponding to the first corpus group;
acquiring a second associated corpus from a second corpus set corresponding to the second corpus group; the second associated corpus and the first associated corpus are derived from target source documents, and the quantity of the target corpuses generated in a plurality of source documents corresponding to the system database in the target source documents meets target screening conditions;
moving the first associated corpus to a second corpus set corresponding to the second corpus group;
and moving the second associated linguistic data to a second linguistic data set corresponding to the first linguistic data group.
7. The method of claim 1 or 2, further comprising:
acquiring at least one new corpus;
performing word segmentation on the new corpus to obtain corpus keywords of each new corpus;
acquiring keyword repetition degrees of the corpus keywords in a first corpus set and a second corpus set corresponding to the corpus group corresponding to the new corpus respectively;
and adding the new language material to a first language material set or a second language material set corresponding to the language material group corresponding to the new language material according to the keyword repetition degree.
8. The method of claim 1 or 2, further comprising:
acquiring a target query request, wherein the target query request at least comprises query keywords;
performing corpus query in a first corpus set and a second corpus set corresponding to the target corpus set respectively by using the query keyword to obtain a first query result and a second query result; the target corpus is a corpus corresponding to the service type and the query keyword;
sorting the corpus in the first query result and the corpus in the second query result according to corpus similarity to obtain a sorting result;
and outputting the language material in the first query result and the language material in the second query result according to the sequencing result.
9. A corpus data processing apparatus, comprising:
the system comprises a data obtaining unit, a query unit and a query unit, wherein the data obtaining unit is used for obtaining a system database corresponding to an application system, and the application system can obtain corpora corresponding to a query request in the system database aiming at the query request; the system database at least comprises a plurality of corpus groups, and each corpus group corresponds to a service type; each corpus group comprises at least one target corpus;
a corpus dividing unit, configured to divide the target corpus into a first corpus set and a second corpus set according to the corpus hit rate of the target corpus for each corpus group; the first corpus set comprises at least one first corpus, the second corpus set comprises at least one second corpus, and the corpus hit rate of the first corpus is greater than that of the second corpus;
and the corpus processing unit is used for adding or deleting the corpus in the second corpus set corresponding to the second corpus group by using the corpus in the second corpus set corresponding to the first corpus group, and the first corpus group and the second corpus group have an association relation.
10. An electronic device, comprising:
a memory for storing a computer program and data generated by the execution of the computer program;
a processor for executing the computer program to implement: the method comprises the steps that a system database corresponding to an application system is obtained, and the application system can obtain corpora corresponding to query requests in the system database aiming at the query requests; the system database at least comprises a plurality of corpus groups, and each corpus group corresponds to a service type respectively; each corpus group comprises at least one target corpus; aiming at each corpus group, dividing the target corpus into a first corpus set and a second corpus set according to the corpus hit rate of the target corpus; the first corpus set comprises at least one first corpus, the second corpus set comprises at least one second corpus, and the corpus hit rate of the first corpus is greater than that of the second corpus; and adding or deleting the linguistic data in the second linguistic data set corresponding to the second linguistic data group by using the linguistic data in the second linguistic data set corresponding to the first linguistic data group, wherein the first linguistic data group and the second linguistic data group have an association relation.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211052774.XA CN115391539A (en) | 2022-08-31 | 2022-08-31 | Corpus data processing method and device and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211052774.XA CN115391539A (en) | 2022-08-31 | 2022-08-31 | Corpus data processing method and device and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN115391539A true CN115391539A (en) | 2022-11-25 |
Family
ID=84123967
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211052774.XA Pending CN115391539A (en) | 2022-08-31 | 2022-08-31 | Corpus data processing method and device and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115391539A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117473069A (en) * | 2023-12-26 | 2024-01-30 | 深圳市明源云客电子商务有限公司 | Business corpus generation method, device and equipment and computer readable storage medium |
-
2022
- 2022-08-31 CN CN202211052774.XA patent/CN115391539A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117473069A (en) * | 2023-12-26 | 2024-01-30 | 深圳市明源云客电子商务有限公司 | Business corpus generation method, device and equipment and computer readable storage medium |
CN117473069B (en) * | 2023-12-26 | 2024-04-12 | 深圳市明源云客电子商务有限公司 | Business corpus generation method, device and equipment and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11663254B2 (en) | System and engine for seeded clustering of news events | |
US10459971B2 (en) | Method and apparatus of generating image characteristic representation of query, and image search method and apparatus | |
CN112581006B (en) | Public opinion information screening and enterprise subject risk level monitoring public opinion system and method | |
CN106919702B (en) | Keyword pushing method and device based on document | |
WO2017097231A1 (en) | Topic processing method and device | |
CN110287328A (en) | A kind of file classification method, device, equipment and computer readable storage medium | |
CN112348629A (en) | Commodity information pushing method and device | |
CN111191111B (en) | Content recommendation method, device and storage medium | |
CN113297457B (en) | High-precision intelligent information resource pushing system and pushing method | |
Petrovic | Real-time event detection in massive streams | |
CA2956627A1 (en) | System and engine for seeded clustering of news events | |
CN112487283A (en) | Method and device for training model, electronic equipment and readable storage medium | |
CN101211368B (en) | Method for classifying search term, device and search engine system | |
CN110458207A (en) | A kind of corpus Intention Anticipation method, corpus labeling method and electronic equipment | |
CN115391539A (en) | Corpus data processing method and device and electronic equipment | |
CN116610853A (en) | Search recommendation method, search recommendation system, computer device, and storage medium | |
CN107908649B (en) | Text classification control method | |
CN109344232A (en) | A kind of public feelings information search method and terminal device | |
EP4207035A1 (en) | Sorting method, apparatus and device, and computer storage medium | |
CN110766488A (en) | Method and device for automatically determining theme scene | |
US11822609B2 (en) | Prediction of future prominence attributes in data set | |
CN111651643A (en) | Processing method of candidate content and related equipment | |
Lei et al. | Chinese text classification for small sample set | |
CN110275986A (en) | Video recommendation method, server and computer storage medium based on collaborative filtering | |
Kalathil et al. | application of Text analytics to extract and analyze Material–application Pairs from a large scientific corpus |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |