CN111177379B - Classification method for low-precision problems, intelligent terminal and computer readable storage medium - Google Patents

Classification method for low-precision problems, intelligent terminal and computer readable storage medium Download PDF

Info

Publication number
CN111177379B
CN111177379B CN201911330731.1A CN201911330731A CN111177379B CN 111177379 B CN111177379 B CN 111177379B CN 201911330731 A CN201911330731 A CN 201911330731A CN 111177379 B CN111177379 B CN 111177379B
Authority
CN
China
Prior art keywords
candidate
question
main
target
questions
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911330731.1A
Other languages
Chinese (zh)
Other versions
CN111177379A (en
Inventor
陈栋梁
熊友军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Youbixuan Intelligent Robot Co ltd
Original Assignee
Ubtech Robotics Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ubtech Robotics Corp filed Critical Ubtech Robotics Corp
Priority to CN201911330731.1A priority Critical patent/CN111177379B/en
Publication of CN111177379A publication Critical patent/CN111177379A/en
Application granted granted Critical
Publication of CN111177379B publication Critical patent/CN111177379B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application discloses a classification method of low-precision problems, an intelligent terminal and a computer readable storage medium, wherein the classification method of the low-precision problems comprises the following steps: acquiring a plurality of candidate main questions matched with the target questions; performing data analysis on a major likelihood problem set table corresponding to each candidate major problem so as to amplify differences among a plurality of candidate major problems; and obtaining the similarity between the target problem and each candidate main problem after the difference is amplified so as to classify the target problem as the candidate main problem with the highest similarity. By the method, the low-precision problems can be accurately classified into the candidate main problem with the highest similarity among the candidate main problems matched with the high-precision single main problems by amplifying the difference among the candidate main problems, so that the corpus of the corresponding question-answering system can be enriched, and the differential recognition processing capacity of the corpus can be enhanced.

Description

Classification method for low-precision problems, intelligent terminal and computer readable storage medium
Technical Field
The present disclosure relates to the field of low-precision problem classification, and in particular, to a low-precision problem classification method, an intelligent terminal, and a computer readable storage medium.
Background
The question-answering system, which is abbreviated as FAQ (Frequently Asked Questions) system, is a high-level form of information retrieval, can more accurately understand questions posed by users in natural language, and returns concise and accurate matching answers through retrieving a question-answering knowledge base. Compared with a search engine, the question and answer system can better understand the real intention of the user question and can further and more effectively meet the information retrieval requirement of the user. The question-answering system is a research direction which is attracting attention and has wide development prospect in the fields of artificial intelligence and natural language processing at present.
In order to enhance the robustness of automatic response, the FAQ system needs to add similar questions to existing questions in the library, enhance the generalization capability of the existing questions, and add new questions and answers to user questions that all questions in the library cannot be matched and answered.
Among them, for those low-precision questions that contain non-disposable value but need to be refined to use, because they may match with a plurality of high-precision single-matching main questions, and there is a possibility of a matching error, it is necessary to accurately classify them as the high-precision single-matching main questions among which the most matching is performed, so that an accurate answer can be obtained.
Disclosure of Invention
The application provides a low-precision problem classifying method, an intelligent terminal and a computer readable storage medium, wherein the low-precision problem classifying method can effectively solve the problem that the low-precision problem cannot be accurately classified as the most matched one of a plurality of high-precision independent matched main problems.
In order to solve the technical problems, one technical scheme adopted by the application is as follows: provided is a low-precision problem classifying method, wherein the low-precision problem classifying method comprises the following steps: acquiring a plurality of candidate main questions matched with the target questions; performing data analysis on a major likelihood problem set table corresponding to each candidate major problem so as to amplify differences among a plurality of candidate major problems; and obtaining the similarity between the target problem and each candidate main problem after the difference is amplified so as to classify the target problem as the candidate main problem with the highest similarity.
After the step of obtaining the plurality of candidate main questions matched with the target question, performing data analysis on the main similar question set corresponding to each candidate main question to amplify the difference between the plurality of candidate main questions, the method further comprises: a master likelihood problem set table for each candidate master problem is obtained.
The step of performing data analysis on the major likelihood problem set table corresponding to each candidate major problem to amplify the difference between the plurality of candidate major problems includes: performing word segmentation on a main similar problem set table corresponding to each candidate main problem, and performing word frequency statistics to amplify differences among a plurality of candidate main problems; the step of acquiring the similarity between the target question and each candidate main question after the amplified difference so as to classify the target question as the candidate main question with the highest similarity comprises the following steps: and obtaining the average hit quantity of the word segmentation of each candidate main question in the main similar question set according to word frequency statistics so as to classify the target question as the candidate main question with the highest average hit quantity.
The step of obtaining the average hit amount of the word segmentation of each candidate main question in the main similar question set according to word frequency statistics so as to classify the target question as the candidate main question with the highest average hit amount comprises the following steps: acquiring the average hit quantity and corresponding coordinates of the word segmentation of each candidate main question in the main question set according to word frequency statistics; word segmentation is carried out on the target problem so as to calculate and count the relative coordinate values of the word segmentation of the target problem on the main similar problem set of each candidate main problem; and if the average hit amount of the candidate main questions with the highest average hit amount is higher than the sum of the second and third average hit amounts, and the relative coordinate value is between the preset interval threshold values, classifying the target questions as the candidate main questions with the highest average hit amount.
The step of obtaining the average hit amount and the corresponding coordinate distribution of the word segmentation of each candidate main question in the main similar question set according to word frequency statistics comprises the following steps: and acquiring the average hit quantity of the word segmentation of each candidate main question in the main similar question set according to word frequency statistics, and the average hit quantity and variance of the word segmentation of each question in the main similar question set of each candidate main question, so as to further calculate the coordinates of the word segmentation of each candidate main question in the main similar question set.
The classifying method of the low-precision problem further comprises the following steps: if the average hit amount of the candidate main problem with the highest average hit amount is lower than the sum of the second hit amount and the third hit amount, or the relative coordinate value is not between the preset interval threshold values, the target problem is not classified.
Wherein the preset interval threshold is [ -2,2].
The step of obtaining the main similar problem set table of each candidate main problem comprises the following steps: acquiring a major similar problem collection table of each candidate major problem and an identity identification number of each candidate major problem; the step of acquiring the similarity between the target question and each candidate main question after the amplified difference so as to classify the target question as the candidate main question with the highest similarity comprises the following steps: and obtaining the similarity of the target questions and each candidate master question after the amplified difference so as to obtain the identity identification number of the candidate master question with the highest similarity with the target questions, and classifying the target questions into the corresponding candidate master questions according to the identity identification number.
In order to solve the technical problem, another technical scheme adopted by the application is as follows: providing an intelligent terminal, wherein the intelligent terminal comprises a memory and a processor which are mutually coupled; the memory stores program data; the processor is configured to execute the program data to implement the low-precision problem classification method as described in any one of the above.
In order to solve the technical problem, another technical scheme adopted by the application is as follows: there is provided a computer readable storage medium storing program data executable to implement the method of classifying a low accuracy problem as described in any one of the above.
The beneficial effects of this application are: compared with the prior art, the method for classifying the low-precision problems in the application is characterized in that the data analysis is carried out on the main similar problem set table corresponding to each candidate main problem of the low-precision problems, so that the difference between the plurality of candidate main problems is amplified, the similarity between the target problem and each candidate main problem after the amplification of the difference is obtained, and the target problem is classified into the candidate main problem with the highest similarity, so that the low-precision problems can be accurately classified into the candidate main problem with the highest similarity in the plurality of candidate main problems matched with the high-precision single main problems, the corpus of the corresponding question-answering system can be enriched, and the differential recognition processing capacity of the corpus of the corresponding question-answering system is enhanced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:
FIG. 1 is a flow chart of a first embodiment of a method for classifying low-precision problems of the present application;
FIG. 2 is a flow chart of a second embodiment of a method for classifying low-precision problems of the present application;
FIG. 3 is a flow chart of a third embodiment of a method for classifying low-precision problems of the present application;
FIG. 4 is a flow chart of a fourth embodiment of a method for classifying low-precision problems of the present application;
FIG. 5 is a flow chart of a fifth embodiment of a method for classifying low-precision problems of the present application;
FIG. 6 is a flow chart of a sixth embodiment of a method for classifying low-precision problems of the present application;
FIG. 7 is a flow chart of a seventh embodiment of a method for classifying low-precision problems of the present application;
FIG. 8 is a flow chart of a method for classifying low-precision problems according to the present application in a specific application scenario;
FIG. 9 is a schematic structural diagram of an embodiment of a smart terminal of the present application;
FIG. 10 is a schematic diagram of an embodiment of a computer-readable storage medium of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
Referring to fig. 1, fig. 1 is a flowchart illustrating a first embodiment of a method for classifying a low-precision problem according to the present application. The embodiment comprises the following steps:
s110: and acquiring a plurality of candidate main questions matched with the target question.
In the present embodiment, a plurality of candidate main questions having a similarity to the target question to some extent are first acquired. The target questions can be segmented to be matched with each question in the knowledge base of the corresponding question-answering system one by one, and main questions similar to the target questions are selected to be used as candidate main questions of the target questions, and specifically, the main questions comprising the same segmented words can be determined to be candidate main questions. It will be appreciated that for low precision target questions, there may be multiple candidate master questions that match the target questions, as there may be multiple master questions with the same word segmentation.
S120: and carrying out data analysis on the major likelihood problem set table corresponding to each candidate major problem so as to amplify the difference among the plurality of candidate major problems.
In this embodiment, after a plurality of candidate main questions matched with the target question are obtained, the data analysis may be further performed on the main similar question set table corresponding to each candidate main question. For example, the method can count and obtain the mathematical distribution of the corresponding word by word segmentation on the main similar problem set table of each candidate main problem, and take the mathematical distribution of the word segmentation of the main similar problem set table of each candidate main problem as the reference data of the similarity between each candidate main problem and the target problem, so as to amplify the difference between the plurality of candidate main problems.
S130: and obtaining the similarity between the target problem and each candidate main problem after the difference is amplified so as to classify the target problem as the candidate main problem with the highest similarity.
In this embodiment, after the differences between the plurality of candidate main questions are amplified by the data analysis performed on the main likelihood question set table of each candidate main question, the similarity between the target question and each candidate main question amplified by the differences may be quantified, so that the plurality of candidate main questions are ranked according to the similarity obtained by the data analysis, so that the target question is classified into the candidate main question with the highest similarity to the target question, the answer of the candidate main question with the highest similarity to the target question is called through the knowledge base of the FAQ system to serve as the answer of the target question, and the target question is added into the corpus of the FAQ system as a class of the candidate main questions with the highest similarity to the target question, so as to enhance the differential recognition processing capability of the FAQ system.
Compared with the prior art, the method for classifying the low-precision problems in the application is characterized in that the data analysis is carried out on the main similar problem set table corresponding to each candidate main problem of the low-precision problems, so that the difference between the plurality of candidate main problems is amplified, the similarity between the target problem and each candidate main problem after the amplification of the difference is obtained, and the target problem is classified into the candidate main problem with the highest similarity, so that the low-precision problems can be accurately classified into the candidate main problem with the highest similarity in the plurality of candidate main problems matched with the high-precision single main problems, the corpus of the corresponding question-answering system can be enriched, and the differential recognition processing capacity of the corpus of the corresponding question-answering system is enhanced.
Referring to fig. 2, fig. 2 is a flowchart illustrating a second embodiment of a method for classifying a low-precision problem according to the present application. The low-precision problem classifying method of the present embodiment is a flowchart of a refinement embodiment of the low-precision problem classifying method in fig. 1, and includes the following steps:
in this embodiment, S210, S230 and S240 are the same as S110, S120 and S130 in fig. 1, respectively, refer to fig. 1 and the related text descriptions thereof, and are not repeated here. After S210 and before S230, the present embodiment further includes the following steps:
s220: a master likelihood problem set table for each candidate master problem is obtained.
In this embodiment, after obtaining a plurality of candidate main questions matched with the target question, the similar questions corresponding to each candidate main question may be further obtained from the corpus of the corresponding question-answering system, so as to construct a main similar question set table of each candidate main question.
In a specific embodiment, taking "monkey-call" as an example, the similar questions can be "how to call" monkey-call "," how to call "other persons" monkey-call "," may call "family" by monkey-call "," how to call "telephone number", "number" by monkey-call "and" how to call "by monkey-call customer service", and these questions and "monkey-call" can be classified into one category and constructed as a main similar question collection table.
Referring to fig. 3, fig. 3 is a flowchart illustrating a third embodiment of a method for classifying a low-precision problem according to the present application. The low-precision problem classifying method of the present embodiment is a flowchart of a refinement embodiment of the low-precision problem classifying method in fig. 2, and includes the following steps:
in this embodiment, S310 and S320 are the same as S210 and S220 in fig. 2, respectively, refer to fig. 2 and related text descriptions, and are not repeated here. After S320, the present embodiment further includes the following steps:
s330, performing word segmentation on the main similar problem set table corresponding to each candidate main problem, and performing word frequency statistics to amplify differences among a plurality of candidate main problems.
In this embodiment, after the main similar problem set table of each candidate main problem is obtained, the main similar problem set table of each candidate main problem is segmented, so as to further perform word frequency statistics on the segmented words of each main similar problem set table, so as to determine the occurrence times of each segmented word in the segmented word set corresponding to each main similar problem set table, and thus, the difference between a plurality of candidate main problems can be amplified according to the word frequency statistics.
S340, obtaining the average hit quantity of the word segmentation of each candidate main question in the main similar question set according to word frequency statistics, so as to classify the target question as the candidate main question with the highest average hit quantity.
In this embodiment, after word segmentation is performed on the main similar problem set table corresponding to each candidate main problem, and word frequency statistics is performed, an average hit amount of the segmented word of each candidate main problem in the segmented word set corresponding to the main similar problem set is calculated according to the word frequency statistics, so that the target problem can be classified into the candidate main problem with the highest average hit amount calculated in this time.
In a specific embodiment, the average hit of a word of a candidate dominant question in its dominant question set may be obtained by dividing the total hit of the word in the dominant question set by the question length of the candidate dominant question. Taking the number of candidate main questions with the number of 01 as an example, the length of the questions is 3, the total hit amount is 24, it is understood that the number of the segmented words of the candidate main questions with the number of 01 is 3, and the number of times of the 3 segmented words in the segmented word set corresponding to the main similar question set is 24, accordingly, the average hit amount of the segmented words of the candidate main questions with the number of 01 is 8, and if the average hit amounts corresponding to the segmented words of the other candidate main questions which are finally calculated are all smaller than 8, the target question can be classified into the candidate main questions with the number of 01.
Referring to fig. 4, fig. 4 is a flowchart illustrating a fourth embodiment of a method for classifying a low-precision problem according to the present application. The low-precision problem classifying method of the present embodiment is a flowchart of a refinement embodiment of the low-precision problem classifying method in fig. 3, and includes the following steps:
in this embodiment, S410, S420 and S430 are the same as S310, S320 and S330 in fig. 3, respectively, refer to fig. 3 and the related text descriptions thereof, and are not repeated here. After S430, the present embodiment further includes the following steps:
s440, obtaining the average hit quantity and the corresponding coordinates of the word segmentation of each candidate main question in the main similar question set according to word frequency statistics.
In this embodiment, after word segmentation is performed on the dominant question set table corresponding to each candidate dominant question, and word frequency statistics is performed, an average hit amount of the segmented word of each candidate dominant question in the segmented word set corresponding to the dominant question set of the candidate dominant question and a coordinate value of the average hit amount on a standard distribution generated by the dominant question set may be calculated further according to the word frequency statistics.
S450, performing word segmentation on the target problem to calculate and count the relative coordinate values of the word segmentation of the target problem on the major likelihood problem set of each candidate major problem.
In this embodiment, the target problem is segmented to calculate and count coordinate values of the segmented target problem on a standard distribution generated by a main similar problem set of each candidate main problem, and in other embodiments, the coordinate values may be further normalized to distinguish the similarity between each candidate main problem and the target problem according to the normalized coordinate values.
S460, if the average hit amount of the candidate main questions with the highest average hit amount is higher than the sum of the second and third average hit amounts, and the relative coordinate value is between the preset interval threshold values, classifying the target questions as the candidate main questions with the highest average hit amount.
In this embodiment, when it is determined that the calculated average hit amount of the candidate main questions with the highest average hit amount is greater than the sum of the average hit amounts of the second and third average hit amounts of the average hit amount ranking, and the coordinate value of the word of the target question on the standard distribution generated by the main likelihood question set of the candidate main questions with the highest average hit amount is between the preset interval threshold values, the target question may be classified as the candidate main question with the highest average hit amount. The preset interval threshold may be [ -2,2] or [ -3,3] and the like, which may be appropriately adjusted according to needs, and the present application is not limited thereto.
In a specific embodiment, taking the example that the candidate main questions on the FAQ system obtained by the target questions are 5 knowledge base main questions with the sequence number of [3472,296,870,50,1546], when the average concentration of the corresponding segmentation of the target questions on the main similar question set with the sequence number of [3472,296,870,50,1546] is [5.0,17.5,4.5,2.5,7.5] after the segmentation of the target questions, and the relative coordinates of the segmentation of the target questions on each candidate main question after the normalization processing is [0.84,1.33,0.39, -0.43,0.63], the average hit amount of the candidate main questions with the sequence number of 296 is highest, and because of 17.5 (7.5+5.0), 1.33 epsilon-2, 2], the similarity of the target questions and the candidate main questions with the sequence number of 296 can be determined to be highest, and the segmentation of the target questions has larger weight in the segmentation set corresponding to the main similar question set with the candidate main questions with the sequence number of 296, so that the candidate questions with the sequence number of 296 can be selected as the final answer of the target questions, the FAQ system can answer the candidate questions from the 5 options, and the training rate of the FAQ system can be improved directly.
Referring to fig. 5, fig. 5 is a flowchart of a fifth embodiment of a method for classifying a low-precision problem according to the present application. The low-precision problem classifying method of the present embodiment is a flowchart of a refinement embodiment of the low-precision problem classifying method in fig. 4, and includes the following steps:
in this embodiment, S510, S520, S530, S550, and S560 are the same as S410, S420, S430, S450, and S460 in fig. 4, respectively, and detailed descriptions thereof are omitted herein for brevity. After S530 and before S550, the present embodiment further includes the following steps:
s540, obtaining average hit amounts of the segmented words of each candidate main question in the main similar question set according to word frequency statistics, and average hit amounts and variances of the segmented words of each question in the main similar question set of each candidate main question, so as to further calculate coordinates of the segmented words of each candidate main question in the main similar question set.
In this embodiment, after word segmentation is performed on the dominant likelihood problem set table corresponding to each candidate dominant problem, and word frequency statistics is performed, the average hit amount of the segmented word of each candidate dominant problem in the dominant likelihood problem set of each candidate dominant problem, and the average and variance of the average hit amounts corresponding to the segmented word of each candidate dominant problem and the segmented word of the similar problem of the candidate dominant problem in the dominant likelihood problem set of each candidate dominant problem are further obtained according to the word frequency statistics, so as to further calculate the coordinate value of the segmented word of each candidate dominant problem in the dominant likelihood problem set of each candidate dominant problem.
In a specific embodiment, taking the average hit amount of each question in the dominant question set of candidate dominant questions with id of 8 as 7.76, the variance of 2.35, and the question length of the question with number 01 as 3, and the total hit amount of 24 as an example, it is known that the coordinate value of the question with number 01 is (8-7.76)/2.35=0.103, that is, the coordinate value of the average concentrated amount of the question with number 01 on the standard distribution generated by the corresponding dominant question set is 0.103 according to the standardized coordinate formula (X-mean)/std (where X is the average hit amount corresponding to one of the dominant questions in the dominant question set, mean is the average hit amount generated by the dominant question set, std is the variance), and the coordinate values of other questions in the dominant question set are sequentially calculated according to the calculation method.
Referring to fig. 6, fig. 6 is a flowchart of a sixth embodiment of a method for classifying a low-precision problem according to the present application. The classifying method of the low-precision problem of the present embodiment is a flowchart of still another refinement embodiment of the classifying method of the low-precision problem in fig. 4, and includes the following steps:
in this embodiment, S610, S620, S630, S640 and S650 are the same as S410, S420, S430 and S450 in fig. 4, respectively, and detailed descriptions thereof are omitted herein for brevity. After S650, the present embodiment further includes the following steps:
s660, if the average hit amount of the candidate main questions with the highest average hit amount is lower than the sum of the second and third average hit amounts, or the relative coordinate value is not between the preset interval threshold values, the target questions are not classified.
In this embodiment, when it is determined that the average hit amount of the candidate main question with the highest average hit amount is smaller than the sum of the average hit amounts of the second and third average hit amounts of the average hit amount ranking, or the coordinate value of the word of the target question on the standard distribution generated by the main likelihood question set of the candidate main question with the highest average hit amount is not between the preset interval thresholds, the classification processing is not performed on the target question, so as to wait for the next round of corpus mining to continue to incorporate a new question.
Referring to fig. 7, fig. 7 is a flowchart of a seventh embodiment of a method for classifying a low-precision problem according to the present application. The classifying method of the low-precision problem of the present embodiment is a flowchart of still another refinement embodiment of the classifying method of the low-precision problem in fig. 2, and includes the following steps:
in this embodiment, S710 is the same as S210 in fig. 2, please refer to fig. 2 and the related text descriptions, and the detailed description is omitted herein. After S710, the present embodiment further includes the following steps:
s720, obtaining a master similar problem set table of each candidate master problem and an identity identification number of each candidate master problem.
In this embodiment, when the dominant question set table of each candidate dominant question is obtained, the identity number of each candidate dominant question in the corresponding FAQ system is further obtained, for example, the id of each candidate dominant question is obtained.
And S730, carrying out data analysis on the major likelihood problem set table corresponding to each candidate major problem so as to amplify the difference among the plurality of candidate major problems.
The step is the same as S230 in fig. 2, please refer to S230 and the related text descriptions thereof, and the detailed description is omitted herein.
And S740, acquiring the similarity between the target question and each candidate master question after the difference is amplified so as to acquire the identity identification number of the candidate master question with the highest similarity with the target question, and classifying the target question into the corresponding candidate master question according to the identity identification number.
In this embodiment, after the differences among the plurality of candidate questions are amplified by the data analysis performed on the dominant problem set table of each candidate question, the similarity between the target question and each candidate question amplified by the differences may be quantified, so that the plurality of candidate questions may be ranked according to the similarity obtained by the data analysis, and the identity number of the candidate question with the highest similarity to the target question may be obtained, so that the corresponding candidate question may be found in the FAQ system according to the identity number, and the target question may be categorized into the candidate questions.
Referring to fig. 8, fig. 8 is a flow chart of a specific application scenario of the classification method of the low-precision problem of the present application.
And S810, acquiring user history data of the FAQ system and classifying the user history data.
In the embodiment, because the user problems encountered by the FAQ system every day are numerous, the FAQ system can match the user problems with each problem in the knowledge base one by one, and 3 kinds of judgment are generated on the matching result of the user, namely, the 1 st kind is high-precision independent matching main problem data, and the data are called A class; the 2 nd is data of five main questions of low-precision matching multiple main questions, which is called class B, for example, low-precision data of which the number appearing is the id of the main question, namely [662,4192,1592,3858,2658 ]; the 3 rd is boring data, called class C.
The accuracy of the class A data is high, the class B data contains a lot of matching items suitable for high precision, and the value density of the class C data is low. Thus, 3 different processing strategies can be adopted for 3 kinds of data: class a data, highly trusted and used, but not absolutely trusted; class B data, which contains non-disposable value, but which can be used only after being extracted and cannot be directly used; class C data is not used at all.
The data of the user is divided into A class, B class and C class, then the A class data is completely used as the similar problem of the main problem in the knowledge base, and the average word frequency of each similar problem in the corresponding main similar problem set is generated through the word segmentation of the similar problem, so that the obtained word frequency is formed into mathematical distribution, the accuracy of the mathematical distribution is improved along with the increase of the similar problems, and the B class data can be judged to which main problem belongs to according to the distribution.
After the user history data of the FAQ system is obtained, the user history data may be further detected to determine characteristics of the user history data, and the user history data is classified, so as to determine to execute S820, S830 or S840, so as to further determine a processing method for the user history data.
S820, low-precision multiple matching B-class data.
In this embodiment, when the obtained historical data is determined to be B-class data with low-precision multiple matching, it is known that the B-class data contains non-disposable value, but needs to be refined to be used, and cannot be directly used, and the B-class data is matched with a plurality of main problems with high-precision single matching in the FAQ system, and the B-class data needs to be further determined and classified into the main problems with highest similarity.
And S821, dividing the multi-match candidate main problem of the low-precision multi-match problem into [ V, W, X, Y, Z ].
In the present embodiment, when it is determined that the acquired history data is low-precision multi-matching class B data, S822 is further performed taking as an example the multi-match candidate main problem of the low-precision multi-match problem as [ V, W, X, Y, Z ].
S822, calculating average word frequency and coordinates of the class B problem on each distribution of [ V, W, X, Y, Z ], and sorting according to the hit quantity of the average word frequency.
In this embodiment, after determining the multiple matching item candidate main problem of the low-precision multiple matching problem as [ V, W, X, Y, Z ], further word segmentation is performed on the B-class problem corresponding to the B-class data, word segmentation is performed on the main similar problem set table corresponding to each main problem in [ V, W, X, Y, Z ] respectively, word frequency statistics is performed, so as to calculate average word frequency and coordinates of the B-class problem on each distribution of [ V, W, X, Y, Z ], and the [ V, W, X, Y, Z ] is sequenced according to the calculated average word frequency hit amount of the B-class problem.
S823, judging whether the highest word frequency hit quantity is higher than the sum of the second hit quantity and the third hit quantity of the word frequency hit quantity sequence, and judging whether the coordinates of the highest word frequency hit quantity are between [ -2,2].
In this embodiment, after calculating the average word frequency and coordinates of the class B problem on each distribution of [ V, W, X, Y, Z ], and sorting [ V, W, X, Y, Z ] according to the calculated average word frequency hit amount of the class B problem, further determining whether the highest word frequency hit amount is higher than the sum of the second and third hit amounts of word frequency hit amount sorting, and whether the corresponding coordinates thereof are between [ -2,2].
Wherein S824 is performed if the highest word frequency hit amount is higher than the sum of the word frequency hit amounts of the second and third hits ordered, and its corresponding coordinates are between [ -2,2], and S825 is performed if the highest word frequency hit amount is lower than the sum of the word frequency hit amounts of the second and third hits ordered, or its coordinates are not between [ -2,2].
S824, obtaining the candidate main question id of the highest word frequency hit quantity, and taking the candidate main question id as the best answer of the low-precision multi-matching questions to be classified into the candidate main questions, and ending.
In this embodiment, the candidate main question id of the highest word frequency hit is obtained, so as to find the corresponding main question and answer in the FAQ system according to the id, and the main question and answer are used as the best answer of the low-precision multi-match class B question, so as to classify the class B question into the candidate main question, and the process is finished.
S825, continuing to incorporate the problem in the next round of corpus mining without processing, and ending.
S830, high-precision matching of class A data.
In this embodiment, when the obtained history data is determined to match the class a data with high accuracy, it is known that the class a data has a relatively high accuracy, and can be highly trusted and used, but not absolutely trusted.
S831, the FAQ system knowledge base is similar to the problem set table.
In this embodiment, a main similar problem set table of the FAQ system knowledge base is constructed through the main problems stored in the FAQ system knowledge base and the similar problems corresponding to the main problems.
S832, the FAQ system knowledge base is used for word stock segmentation by using a similar problem set table.
In this embodiment, the word segmentation is further performed on each main similar problem set table of the FAQ system knowledge base, so as to construct a word segmentation word library of each main similar problem set table.
S833, counting the average word frequency of each main similar problem word after word segmentation through a word stock dictionary.
In this embodiment, the average word frequency after word segmentation of each dominant question may be counted by the FAQ system word bank dictionary.
S834, the average word frequency of each question in the main similar question set is formed into a mathematical distribution with id as the main question id in the main similar question set.
In this embodiment, after the average word frequency of each primary question is counted according to the word library dictionary of the FAQ system, the average word frequency of each question in the primary question set may be formed into a mathematical distribution with id being the primary question id in the primary question set.
S835, the mean variance of the mathematical distribution of each set of dominant questions is counted, as well as the relative coordinates of each dominant question' S mean word frequency over this distribution.
In this embodiment, the mean variance of the mathematical distribution of each main question over its corresponding set of main questions and the relative coordinates of the mean word frequency of each main question over the distribution may be counted to form a mathematical model of class a data.
S840, chatting the C data, and discarding the low value.
In this embodiment, the obtained history data is determined to be the boring class C data, which has low value and needs to be completely discarded.
Based on the general inventive concept, the present application further provides an intelligent terminal, referring to fig. 9, and fig. 9 is a schematic structural diagram of an embodiment of the intelligent terminal of the present application. Wherein the smart terminal 90 includes a memory 910 and a processor 920 coupled to each other.
In this embodiment, the memory 910 stores program data, and the processor 920 is operable to execute the program data to implement the low-precision problem classification method as described in any one of the above.
The intelligent terminal 90 may be one of intelligent terminal devices capable of integrating a FAQ system, such as a mobile phone, a tablet computer, a personal computer, a smart watch, a smart robot, and a smart furniture, which is not limited in this application.
Based on the general inventive concept, the present application further provides a computer readable storage medium, referring to fig. 10, and fig. 10 is a schematic structural view of an embodiment of the computer readable storage medium of the present application. Therein, the computer readable storage medium 100 has stored therein program data 1010, which program data 1010 can be executed to implement any of the above-described low-precision problem categorization methods.
In one embodiment, the computer readable storage medium 100 may be a memory chip, a hard disk or a removable hard disk in a terminal, or other readable and writable storage tools such as a flash disk, an optical disk, etc., and may also be a server, etc.
In the several embodiments provided in the present application, it should be understood that the disclosed methods and apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., a division of a processor or memory, merely a division of a logic function, and there may be additional divisions of an actual implementation, e.g., multiple processor and memory implemented functions may be combined or integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or connection shown or discussed with respect to each other may be through some interface, indirect coupling or connection of devices or elements, electrical, mechanical, or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the object of the present embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all or part of the technical solution contributing to the prior art or in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Compared with the prior art, the method for classifying the low-precision problems in the application is characterized in that the data analysis is carried out on the main similar problem set table corresponding to each candidate main problem of the low-precision problems, so that the difference between the plurality of candidate main problems is amplified, the similarity between the target problem and each candidate main problem after the amplification of the difference is obtained, and the target problem is classified into the candidate main problem with the highest similarity, so that the low-precision problems can be accurately classified into the candidate main problem with the highest similarity in the plurality of candidate main problems matched with the high-precision single main problems, the corpus of the corresponding question-answering system can be enriched, and the differential recognition processing capacity of the corpus of the corresponding question-answering system is enhanced.
The foregoing description is only of embodiments of the present application, and is not intended to limit the scope of the patent application, and all equivalent structures or equivalent processes using the descriptions and the contents of the present application or other related technical fields are included in the scope of the patent application.

Claims (6)

1. The method for classifying the low-precision problem is characterized by comprising the following steps of:
acquiring a plurality of candidate main questions matched with the target questions;
acquiring a main similar problem set table of each candidate main problem;
performing data analysis on a major likelihood problem set table corresponding to each candidate major problem so as to amplify differences among a plurality of candidate major problems;
the step of performing data analysis on the dominant similar problem set table corresponding to each candidate dominant problem to amplify the difference between the candidate dominant problems includes: performing word segmentation on a main similar problem set table corresponding to each candidate main problem, and performing word frequency statistics to amplify differences among a plurality of candidate main problems;
obtaining the similarity between the target problem and each candidate main problem after the amplified difference so as to classify the target problem as the candidate main problem with the highest similarity, wherein the method comprises the following steps: acquiring average hit amounts of the segmented words of each candidate main question in a main similar question set according to the word frequency statistics, and average hit amounts and variances of the segmented words of each question in the main similar question set of each candidate main question, so as to further calculate coordinates of the segmented words of each candidate main question in the main similar question set; wherein the average hit amount is a quotient of a total hit amount of the word of the candidate main question in a main similar question set divided by a question length of the candidate main question;
performing word segmentation on the target problem to calculate and count relative coordinate values of word segmentation of the target problem on a major likelihood problem set of each candidate major problem;
and classifying the target problem as the candidate main problem with the highest average hit amount if the average hit amount of the candidate main problem with the highest average hit amount is higher than the sum of the second and third average hit amounts and the relative coordinate value is between a preset interval threshold value.
2. The method of claim 1, further comprising:
and if the average hit quantity of the candidate main problem with the highest average hit quantity is lower than the sum of the second hit quantity and the third hit quantity, or the relative coordinate value is not between the preset interval threshold values, not classifying the target problem.
3. The method of claim 1, wherein,
the preset interval threshold is [ -2,2].
4. The method of claim 1, wherein the step of obtaining a master likelihood problem set table for each of the candidate master problems comprises:
acquiring a major similar problem set table of each candidate major problem and an identity identification number of each candidate major problem;
the step of obtaining the similarity between the target question and each candidate main question after the amplified difference so as to classify the target question as the candidate main question with the highest similarity comprises the following steps:
and obtaining the similarity of the target questions and each candidate master question after the amplified difference so as to obtain the identity identification number of the candidate master question with the highest similarity with the target questions, and classifying the target questions into the corresponding candidate master questions according to the identity identification number.
5. An intelligent terminal, characterized in that the intelligent terminal comprises a memory and a processor which are mutually coupled;
the memory stores program data;
the processor is configured to execute the program data to implement the low-precision problem classification method according to any one of claims 1-4.
6. A computer readable storage medium storing program data executable to implement the method of classifying a low-precision problem according to any one of claims 1-4.
CN201911330731.1A 2019-12-20 2019-12-20 Classification method for low-precision problems, intelligent terminal and computer readable storage medium Active CN111177379B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911330731.1A CN111177379B (en) 2019-12-20 2019-12-20 Classification method for low-precision problems, intelligent terminal and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911330731.1A CN111177379B (en) 2019-12-20 2019-12-20 Classification method for low-precision problems, intelligent terminal and computer readable storage medium

Publications (2)

Publication Number Publication Date
CN111177379A CN111177379A (en) 2020-05-19
CN111177379B true CN111177379B (en) 2023-05-23

Family

ID=70654113

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911330731.1A Active CN111177379B (en) 2019-12-20 2019-12-20 Classification method for low-precision problems, intelligent terminal and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN111177379B (en)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411592A (en) * 2010-09-21 2012-04-11 阿里巴巴集团控股有限公司 Text classification method and device
CN104123272A (en) * 2014-05-21 2014-10-29 山东省科学院情报研究所 Document classification method based on variance
CN105260467A (en) * 2015-10-16 2016-01-20 中国联合网络通信集团有限公司 Short message classification method and apparatus
WO2016112558A1 (en) * 2015-01-15 2016-07-21 深圳市前海安测信息技术有限公司 Question matching method and system in intelligent interaction system
CN106815311A (en) * 2016-12-21 2017-06-09 杭州朗和科技有限公司 A kind of problem matching process and device
CN108595695A (en) * 2018-05-08 2018-09-28 和美(深圳)信息技术股份有限公司 Data processing method, device, computer equipment and storage medium
CN109101479A (en) * 2018-06-07 2018-12-28 苏宁易购集团股份有限公司 A kind of clustering method and device for Chinese sentence

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7603330B2 (en) * 2006-02-01 2009-10-13 Honda Motor Co., Ltd. Meta learning for question classification

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411592A (en) * 2010-09-21 2012-04-11 阿里巴巴集团控股有限公司 Text classification method and device
CN104123272A (en) * 2014-05-21 2014-10-29 山东省科学院情报研究所 Document classification method based on variance
WO2016112558A1 (en) * 2015-01-15 2016-07-21 深圳市前海安测信息技术有限公司 Question matching method and system in intelligent interaction system
CN105260467A (en) * 2015-10-16 2016-01-20 中国联合网络通信集团有限公司 Short message classification method and apparatus
CN106815311A (en) * 2016-12-21 2017-06-09 杭州朗和科技有限公司 A kind of problem matching process and device
CN108595695A (en) * 2018-05-08 2018-09-28 和美(深圳)信息技术股份有限公司 Data processing method, device, computer equipment and storage medium
CN109101479A (en) * 2018-06-07 2018-12-28 苏宁易购集团股份有限公司 A kind of clustering method and device for Chinese sentence

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
官赛萍 等.基于WMD距离与近邻传播的新闻评论聚类.《中文信息学报》.2017,第31卷(第5期),第203-213页. *

Also Published As

Publication number Publication date
CN111177379A (en) 2020-05-19

Similar Documents

Publication Publication Date Title
CN105808526B (en) Commodity short text core word extracting method and device
CN107515877B (en) Sensitive subject word set generation method and device
WO2020108608A1 (en) Search result processing method, device, terminal, electronic device, and storage medium
CN105022761B (en) Group searching method and device
US9367603B2 (en) Systems and methods for behavioral segmentation of users in a social data network
US9075865B2 (en) System for estimating age of a user based on mass data
US8510164B2 (en) Method and system for targeted advertising based on topical memes
CN103258025B (en) Generate the method for co-occurrence keyword, the method that association search word is provided and system
US10019492B2 (en) Stop word identification method and apparatus
CN104881458B (en) A kind of mask method and device of Web page subject
CN106777282B (en) The sort method and device of relevant search
CN110826605A (en) Method and device for identifying user in cross-platform manner
CN111597821A (en) Method and device for determining response probability
US20150248471A1 (en) Group forming method, data collecting method and data collecting apparatus
CN104346411B (en) The method and apparatus that multiple contributions are clustered
CN111177379B (en) Classification method for low-precision problems, intelligent terminal and computer readable storage medium
CN115374793B (en) Voice data processing method based on service scene recognition and related device
CN109462635B (en) Information pushing method, computer readable storage medium and server
CN109636378B (en) Account identification method and device and electronic equipment
CN111666749B (en) Hot article identification method
CN114650239A (en) Data brushing amount identification method, storage medium and electronic equipment
KR102078541B1 (en) Issue interest based news value evaluation apparatus and method, storage media storing the same
CN114444514A (en) Semantic matching model training method, semantic matching method and related device
CN113065025A (en) Video duplicate checking method, device, equipment and storage medium
CN111368201A (en) Hot event detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20231124

Address after: Room 601, 6th Floor, Building 13, No. 3 Jinghai Fifth Road, Beijing Economic and Technological Development Zone (Tongzhou), Tongzhou District, Beijing, 100176

Patentee after: Beijing Youbixuan Intelligent Robot Co.,Ltd.

Address before: 518000 16th and 22nd Floors, C1 Building, Nanshan Zhiyuan, 1001 Xueyuan Avenue, Nanshan District, Shenzhen City, Guangdong Province

Patentee before: Shenzhen Youbixuan Technology Co.,Ltd.