CN110941719A

CN110941719A - Data classification method, test method, device and storage medium

Info

Publication number: CN110941719A
Application number: CN201911214205.9A
Authority: CN
Inventors: 杨玉; 刘华英; 刘燕; 李凤亭; 梁雨霏; 刘晓刚
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2019-12-02
Filing date: 2019-12-02
Publication date: 2020-03-31
Anticipated expiration: 2039-12-02
Also published as: CN110941719B

Abstract

The embodiment of the specification provides a data classification method, a test method, a device and a storage medium. The method comprises the following steps: acquiring a target data set; the target dataset comprises a plurality of text data; calculating the occurrence frequency of a plurality of preset keywords in each text data; and determining the category of each text data in the target data set according to the frequency. The vehicle adaptive cruise control method provided by the embodiment of the specification can classify data according to the frequency of a plurality of preset keywords in the data, improves the accuracy of data classification, can automatically classify a large amount of data, and improves the efficiency of data classification.

Description

Data classification method, test method, device and storage medium

Technical Field

The embodiment of the specification relates to the technical field of computers, in particular to a data classification method, a test method, a device and a storage medium.

Background

The intelligent customer service system collects mass voice data and text data of the customer service center through the voice of a customer, sorts and refines valuable information through big data analysis, and pushes the valuable information to a business department, so that products and services are continuously improved, and the customer service center is promoted to be transformed from the service department to a decision support department and from an after-sale service link to a whole-course service. Wherein the big data analysis is performed by an analysis system. Before the application of the big data analysis system, the software center test line of the Chinese bank needs to perform function test on the analysis system to verify whether some functions of the analysis system can be normally used and need a large amount of data for support, and the larger the data amount is, the more sufficient the test on the big data analysis system is, such as clustering and classification functions, business model establishing functions, high-frequency word extracting functions, hot word frequency ratio and the like.

In the existing testing method for the big data analysis system, mostly, the required testing data, i.e. sample data, is obtained by manually sampling the obtained data source, and then the sample data is used for verifying the related functions. The manual sampling obtains the test data, and each piece of text data needs to be classified and screened, so that under the condition of large data volume, the time consumption is long, the efficiency is low, and the cost is high. In the testing process, the accuracy of data classification in the test data is not high, so that the accuracy and the efficiency of the test result are influenced.

Disclosure of Invention

An object of the embodiments of the present specification is to provide a data classification method, a test method, an apparatus, and a storage medium, so as to improve accuracy and efficiency of data classification and accuracy and efficiency of system test.

In order to solve the above problems, embodiments of the present specification provide a data classification method, a test method, an apparatus, and a storage medium.

A method of data classification, the method comprising: acquiring a target data set; the target dataset comprises a plurality of text data; calculating the occurrence frequency of a plurality of preset keywords in each text data; and determining the category of each text data in the target data set according to the frequency.

A method of testing, the method comprising: acquiring a first test data set; the first test data set comprises at least one category of text data; inputting the first test data set into a system to be tested to obtain a first test result; and under the condition that the first test result does not completely accord with a first expected result, if the part of the first test result which does not accord with the first expected result accords with a second expected result, determining that the test is passed.

A computer readable storage medium having computer program instructions stored thereon that when executed implement: acquiring a target data set; the target dataset comprises a plurality of text data; acquiring a plurality of preset keywords, and calculating the occurrence frequency of each preset keyword in each text data; and determining the category of each text data in the target data set according to the frequency.

A computer readable storage medium having computer program instructions stored thereon that when executed implement: acquiring a first test data set; the first test data set comprises at least one category of text data; inputting the first test data set into a system to be tested to obtain a first test result; and under the condition that the first test result does not completely accord with a first expected result, if the part of the first test result which does not accord with the first expected result accords with a second expected result, determining that the test is passed.

As can be seen from the technical solutions provided in the embodiments of the present specification, the embodiments of the present specification can acquire a target data set; the target dataset comprises a plurality of text data; acquiring a plurality of preset keywords, and calculating the occurrence frequency of each preset keyword in each text data; and determining the category of each text data in the target data set according to the frequency. The data classification method provided by the embodiment of the specification can classify data according to the frequency of occurrence of a plurality of preset keywords in the data, improves the accuracy of data classification, further can automatically classify a large amount of data, and simultaneously improves the efficiency of data classification.

Embodiments of the present description may obtain a first set of test data; the first test data set comprises at least one category of text data; inputting the first test data set into a system to be tested to obtain a first test result; and under the condition that the first test result does not completely accord with a first expected result, if the part of the first test result which does not accord with the first expected result accords with a second expected result, determining that the test is passed. The test method provided by the embodiment of the specification adopts a test method combining two tests for testing the system, so that the test efficiency can be improved, and the test accuracy can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present specification or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the specification, and other drawings can be obtained by those skilled in the art without creative efforts.

FIG. 1 is a flow chart of a data classification method according to an embodiment of the present disclosure;

FIG. 2 is a diagram illustrating an example scenario in an embodiment of the present disclosure;

FIG. 3 is a flow chart of a testing method according to an embodiment of the present disclosure;

FIG. 4 is a diagram illustrating an example of a scenario in an embodiment of the present disclosure;

FIG. 5 is a functional block diagram of a data sorting apparatus according to an embodiment of the present disclosure;

fig. 6 is a functional block diagram of a testing apparatus according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are only a part of the embodiments of the present disclosure, and not all of the embodiments. All other embodiments obtained by a person of ordinary skill in the art based on the embodiments in the present specification without any creative effort shall fall within the protection scope of the present specification.

The existing text classification method can generally utilize an artificial intelligence method to establish a classification model, and data is classified through the classification model. The classification model is generally established based on Support Vector Machine (SVM), naive bayes classifier, K-nearest neighbor (KNN), decision tree, random forest and other algorithms. However, the classification of data by establishing a classification model through an artificial intelligence method usually has uncertainty, i.e., a certain false alarm rate, so that the accuracy of data classification is not sufficient. Therefore, a text classification method with higher accuracy is needed to classify the text.

In this embodiment, the main body executing the data classification method may be an electronic device with a logical operation function, the electronic device may be a server or a client, and the client may be a desktop computer, a tablet computer, a notebook computer, a smart phone, a workstation, or the like. Of course, the client is not limited to the electronic device with certain entities, and may also be software running in the electronic device. It may also be program software formed by program development, which may be run in the above-mentioned electronic device.

Fig. 1 is a flowchart of a data classification method according to an embodiment of the present disclosure. As shown in fig. 1, the data classification method may include the following steps.

S110: acquiring a target data set; the target data set includes a plurality of text data.

In some embodiments, the target data set may include a plurality of text data, such as text data in XML, HTML, etc., or other formats. The target data set may also comprise audio data, for example of AIFF, MP 3.

In some embodiments, the target data set may be obtained from a data source, such as downloading the target data set from a designated database. For example, in a banking system, audio data and/or text data may be obtained by using a database in an intelligent customer service platform as a data source. The audio data can be audio data recorded by the seat and the client during communication; the text data may be text-like data such as seat interaction with the customer, collected customer opinions or suggestions, and the like.

In some embodiments, the target data set obtained from the banking system typically relates to security data or some business sensitive data of the customer. In order to protect sensitive private data, data in the target data set can be used after being modified, for example, data desensitization processing is performed on personal information such as identity card numbers, mobile phone numbers, card numbers, customer numbers and the like in the target data set.

In some embodiments, if audio data is included in the target data set, the audio data may also be converted to text data. Specifically, the content expressed in the audio data may be output in the form of text through a speech recognition technique.

In some embodiments, after the target data set is acquired, it may be determined whether the data in the target data set is text data or audio data. If the data in the target data set is text data, a set of text data { T } can be obtained₁,T₂,T₃,...T_mIn which T is_m(m ═ 1,2, 3.) denotes text data. If the data in the target data set comprises a set of audio data S₁,S₂,S₃,...S_nConverting the audio data into corresponding text data to obtain a set of converted text data

Wherein S_n(n-1, 2, 3.) represents audio data,

representing audio data S_nCorresponding text data. Further, a target data set { T } containing only text data may be obtained₁,T_S1,T₂,T_S2,T₃,T_S3,...T_m,T_Sn}。

S120: the frequency of occurrence of a plurality of preset keywords in each text data is calculated.

In some embodiments, the preset keywords may be used to characterize categories of text data in the target dataset. For example, the category name of the text data may include the preset keyword, or the category name of the text data may also correspond to the preset keyword. Specifically, if the category names of the text data in the target data set are determined as a financing category, a loan category, a deposit category, and a transfer category, the plurality of preset keywords may include "financing", "loan", "deposit", "transfer", that is, in this case, the keywords may be used as the category names of the text data. If the category names of the text data in the target data set are determined as a category, b category, c category and d category, the category a may be determined as a category corresponding to a preset keyword "financing", the category b may be determined as a category corresponding to a preset keyword "loan", the category c may be determined as a category corresponding to a preset keyword "deposit", and the category d may be determined as a category corresponding to a preset keyword "transfer".

In this embodiment, the server may obtain the plurality of preset keywords in any manner. For example, a user may input a plurality of preset keywords, and the server may receive the keywords; for another example, other electronic devices except the server may send a plurality of preset keywords to the server, and the server may receive the keywords.

In some embodiments, taking the preset keywords as examples, including "financing", "loan", "deposit", the keywords may be obtained after obtaining the preset keywordsSet { F₁,F₂,F₃,...,F_nIn which F₁In terms of "financing", F₂1-loan₃N is 1,2,3.

In this embodiment, the server may identify the text content in the text data, and may calculate the frequency of occurrence of each preset keyword in each text data by identifying each text data. In particular, the server may read a target data set T containing text data₁,T_S1,T₂,T_S2,T₃,T_S3,...T_m,T_SnAnd respectively calculating the occurrence frequency of each preset keyword in each text data to obtain a frequency matrix { f₁,f₂,f₃,...,f_n}. Each text corresponds to a frequency matrix; f. of_nCorresponds to a keyword F_nThe frequency of occurrence, n ═ 1,2,3.

S130: and determining the category of each text data in the target data set according to the frequency.

In this specification embodiment, the category of each text data in the target data set may be determined according to the frequency of occurrence of each preset keyword in each text data. Specifically, the category of the text data may be determined as a pending category, a single category, or a multiple category. The pending category represents that the text data does not belong to any one of categories corresponding to preset keywords; the single type represents that the text data belongs to a text type corresponding to a certain preset keyword; the multiple classes indicate that the text data can simultaneously belong to text classes corresponding to at least two preset keywords.

In an embodiment of the present specification, if the frequency of occurrence of each preset keyword in the text data is less than a preset frequency, the text data does not belong to any one of categories corresponding to the preset keyword; if the frequency of occurrence of a certain preset keyword in the text data is greater than or equal to the preset frequency and the frequency of occurrence of other preset keywords is less than the preset frequency, the category of the text data is a single category, and the text data belongs to the category corresponding to the keyword of which the frequency of occurrence is greater than or equal to the preset frequency; and if the occurrence frequency of at least two preset keywords in the text data is greater than or equal to the preset frequency, the category of the text data is multiple.

In some embodiments, the preset frequency may be zero. Specifically, if the frequency of occurrence of each preset keyword in the text data is zero, it is obvious that the text data does not belong to any one of the categories corresponding to the preset keyword; if the frequency of occurrence of a certain preset keyword in the text data is greater than zero and the frequency of occurrence of other preset keywords is equal to zero, the category of the text data is a single category, and the text data belongs to the category corresponding to the keyword of which the frequency of occurrence is greater than zero; and if the occurrence frequency of at least two preset keywords in the text data is greater than zero, the category of the text data is multiple. In the embodiment of the present specification, in order to determine the category of the text data more accurately, the preset frequency may also be any value greater than zero, for example, 1, 3, 10, and the like, which is not limited in the present specification.

In the following, how to determine the category of each text data in the target data set according to the frequency in the embodiment of the present specification is described in a case where the preset frequency is zero. In some embodiments, for each text data, if the frequency of occurrence of each preset keyword is zero, the text data does not belong to any one of the corresponding categories of each preset keyword, and the category of the text data may be determined as the pending category. Specifically, the server may determine whether the frequency matrix of the text data is equal to zero, and if so, { f }₁＝0,f₂＝0,f₃＝0,...,f_n0, the category of the text data may be determined as the pending category.

In some embodiments, for each text data, if the frequency of occurrence of each preset keyword is not all zero, that is, one or more of the preset keywords occur in the text data, it may be further determined whether the category to which the text data belongs is single. If only one of the preset keywords appears in the text data and no other preset keywords appear in the text data, the type of the text data is judged to be single, otherwise, the type of the text data is not single, and the type of the text data with the single type can be determined to be multiple types.

In some embodiments, the server may determine whether the category to which the text data belongs is single through a frequency matrix of the text data. In particular, if there is and only one value in the frequency matrix that is not zero, for example the frequency matrix { f }₁＝0,f₂＝0,f₃＝r...，f_nAnd 0, wherein r is {1,2,3 … }, the text data can be determined to belong to a single category. If the values in the frequency matrix that are not zero are not unique, e.g. the frequency matrix f₁＝0,f₂＝0,f₃＝0,f₄＝k,f₅＝j,...,f_n0, where k and j are not all 0, it may be determined that the text data belongs to a different category.

In some embodiments, if the category of the text data is single, that is, only one preset keyword has a non-zero frequency, and other preset keywords have zero frequencies, the keywords having non-zero frequencies may be recorded, and the category of the text data may be determined as the category corresponding to the keyword having the non-zero frequency. For example, the frequency matrix { f₁＝0,f₂＝0,f₃＝r...，f_n0, where r is {1,2,3 … }, the text data is determined as f₃Corresponds to a keyword F₃And if a preset keyword is used as the category name of the text data, determining the category of the text data as a deposit category.

In some embodiments, if the category of the text data is not unique, i.e., the frequency of occurrence of at least two preset keywords is not zero, the keywords having the frequency of occurrence of not zero may be recorded and the category of the text data may be determined as a multi-category. For example, the frequency matrix { f₁＝0,f₂＝0,f₃＝0,f₄＝k,f₅＝j,...,f_n0, where k and j are not all 0, the number of the text data with the frequency not equal to zero can be calculated to obtain the number of the text data with different preset keywords, and the frequency f is recorded at the same time₃Corresponding key word F₃And frequency f₄Corresponding key word F₄And determining the category of the text data as a plurality of categories.

Embodiments of the present description may obtain a target dataset; the target dataset comprises a plurality of text data; acquiring a plurality of preset keywords, and calculating the occurrence frequency of each preset keyword in each text data; and determining the category of each text data in the target data set according to the frequency. The data classification method provided by the embodiment of the specification can classify data according to the frequency of occurrence of a plurality of preset keywords in the data, improves the accuracy of data classification, further can automatically classify a large amount of data, and simultaneously improves the efficiency of data classification.

The present embodiment provides an example of a scenario, and as shown in fig. 2, fig. 2 is a schematic diagram of an example of a scenario provided by the present embodiment.

In this scenario example, a user may input a preset keyword, and the server may receive the preset keyword, where the preset keyword is used as a category name of the text data.

Specifically, in this scenario example, the preset keywords may include "financing", "loan" and "deposit", respectively. The server can determine the category names of the text data as financing, loan and deposit according to preset keywords, and of course, the category names of the text data also include pending and multiple categories.

In this scenario example, the server may create a folder corresponding to a category name of the text data, for example, designate the folder corresponding to the financing category as financing, designate the folder corresponding to the loan category as loan, designate the folder corresponding to the deposit category as deposit, designate the folder corresponding to the pending category as pending, and designate the folder corresponding to the multiple categories as multiple categories.

In the present scenario example, the server may obtain a preset keyword set, that is, obtain a set of text data category names { F }₀,F₁,F₂,F₃,...,F_n+1In which F₀To be determined, F₁In terms of "financing", F₂1-loan₃…, F-deposit_n+1Or "multi-class". The server may also create a "pending", "financing", "loan", "deposit" … "multi-category" folder corresponding to the category name of the text data under a preset storage path.

In this scenario example, the frequency of occurrence of each category name in each text may be calculated, resulting in a frequency matrix. Specifically, the server may read the text data { T }₁,T_S1,T₂,T_S2,T₃,T_S3,...T_m,T_SnRespectively calculating the appearance of a custom ' category name ' { F ' in each text₁,F₂,F₃,...,F_nGet the frequency matrix f₁,f₂,f₃,...,f_n}. Wherein each text corresponds to a frequency matrix.

In this scenario example, the preset frequency may be zero, and after the frequency matrix is obtained, it may be determined whether the frequency matrix is zero. Specifically, the server may screen out text data with a frequency different from zero. If the frequency matrix corresponding to the text data is zero, i.e. { f }₁＝0,f₂＝0,f₃＝0,...,f_nIf it is 0, the text data can be put under the folder "to be determined"; and if the frequency matrix corresponding to the text data is not zero, carrying out the next step.

In this scenario example, if the frequency matrix corresponding to the text data is not zero, it is determined whether the category to which the text data belongs is single. Specifically, if the text belongs to a single category, there is only one non-zero value in the frequency matrix corresponding to the text data, such as the frequency matrix { f }₁＝0,f₂＝0,f₃＝r...，f_n＝0}，Where r ═ {1,2,3 … }, the text data belongs to category F₃Put the text data under the corresponding deposit folder, and record the corresponding frequency value f₃R. If the text belongs to a non-uniform class, i.e. the values in the frequency matrix that are not zero are not unique, e.g. the frequency matrix { f }₁＝0,f₂＝0,f₃＝0，f₄＝k，f₅＝j，...，f_n0, where k and j are not all 0, then calculating the number of categories to which the text data belongs, where λ is 2 in this scenario example, and recording the names of all categories, where F may be recorded in this scenario example₄、F₅Corresponding category name, and putting the text data under a 'multi-category' file.

In this scenario example, the server may also determine whether to traverse all text data { T }₁,T_S1,T₂,T_S2,T₃,T_S3,...T_m,T_Sn}. If all the text data have been traversed, ending; and if all the text data are not traversed, continuing the classification process of the text data.

Fig. 3 is a flowchart of a testing method according to an embodiment of the present disclosure. As shown in fig. 3, the test method may include the following steps.

S310: acquiring a first test data set; the first test data set comprises at least one category of text data;

big data analysis refers to the analysis of data on a huge scale. Big data can be summarized as 5V, i.e. large data Volume (Volume), fast speed (Velocity), multiple types (Variety), Value (Value), authenticity (Veracity). The big data analysis can be performed through an analysis system, and before the big data analysis system is applied, the analysis system can be subjected to functional testing, such as testing the clustering and classifying functions of the analysis system, establishing a service model function, extracting a high-frequency word function, a hot word frequency ratio function and the like. The test for verifying whether some functions of the analysis system can be normally used requires a large amount of test data for support, and the larger the data amount is, the more sufficient the test for the big data analysis system is.

In the embodiment of the present specification, different types of test data need to be used for different functions of the test analysis system, for example, the clustering and classifying functions of the test analysis system need classified data as the test data, and if the function of extracting high-frequency words of the test analysis system needs data containing some high-frequency words as the test data.

In some embodiments, the first set of test data may be employed as test data for testing the functionality of the analysis system. Wherein the test data set may include at least one category of text data.

In some embodiments, the first test data set may include classified text data such that an expected result of the test may be calculated based on the classification of the text data. For example, to test the function of extracting high-frequency words of the analysis system, text data containing different keywords may be divided into different categories, and the text data may be used as a first test data set, and an expected result of the test may be determined according to the categories of the text data.

In some embodiments, the first test data set may be obtained according to the following steps.

S311: acquiring a target data set; the target data set includes a plurality of text data.

In some embodiments, the target data set may include text data, such as text data in XML, HTML, etc., or other formats. The target data set may also comprise audio data, for example of AIFF, MP 3.

In some embodiments, the target data set may be obtained from a data source, such as downloading the target data set from a designated database. Specifically, in the bank work system, voice data and/or text data can be acquired from a database in the intelligent customer service platform as a data source. The voice data can be audio data recorded by a seat and a client during a call; the text data may be text-type data such as seat interaction with the customer, collected customer opinions or suggestions, and the like.

Wherein S_n(n-1, 2, 3.) represents audio data,

S312: the frequency of occurrence of a plurality of preset keywords in each text data is calculated.

In some embodiments, taking the preset keywords as examples, including "financing", "loan", "deposit", after obtaining the preset keywords, the keyword set { F may be obtained₁,F₂,F₃,...,F_nIn which F₁In terms of "financing", F₂1-loan₃N is 1,2,3.

S313: and determining the category of each text data in the target data set according to the frequency.

In the following, how to determine the category of each text data in the target data set according to the frequency in the embodiment of the present specification is described in a case where the preset frequency is zero.

In some embodiments, for each text data, if the frequency of occurrence of each preset keyword is zero, the text data does not belong to any one of the corresponding categories of each preset keyword, and the category of the text data may be determined as the pending category. Specifically, the server may determine whether the frequency matrix of the text data is equal to zero, and if so, { f }₁＝0,f₂＝0,f₃＝0,...,f_n0, the category of the text data may be determined as the pending category.

In some embodiments, the server may determine whether the category to which the text data belongs is single through a frequency matrix of the text data. In particular, if the frequency matrix isOf and having only one value other than zero, e.g. frequency matrix f₁＝0,f₂＝0,f₃＝r...，f_nAnd 0, wherein r is {1,2,3 … }, the text data can be determined to belong to a single category. If the values in the frequency matrix that are not zero are not unique, e.g. the frequency matrix f₁＝0,f₂＝0,f₃＝0,f₄＝k,f₅＝j,...,f_n0, where k and j are not all 0, it may be determined that the text data belongs to a different category.

S314: and acquiring text data of at least one category in the classified target data set as the first test data set.

In this illustrative embodiment, at least one category of text data in the classified target data set may be used as the first test data set according to the function requirement of the test analysis system.

S320: and inputting the first test data set into a system to be tested to obtain a first test result.

In an embodiment of the present specification, the first test result is an output result of the test system after the first test data set is input into the system to be tested.

S330: and under the condition that the first test result does not completely accord with a first expected result, if the part of the first test result which does not accord with the first expected result accords with a second expected result, determining that the test is passed.

In embodiments of the present description, the first expected result may be determined from the first test data set. Specifically, taking the function of extracting high-frequency words by the test system as an example, the first test data set may include text data with categories of financing, loan, deposit, and transfer. The text data of each category includes corresponding high-frequency words, for example, the text data of the financing category includes a high-frequency word "financing", and the text data of the loan category includes a high-frequency word "loan". Under the condition that the system functions well, after the first test data set is input into the system, the output result of the system is that high-frequency words of 'financing', 'loan', 'deposit', 'transfer' are extracted. Thus, it may be determined that the first expected result using the first test data set extracted high frequency words for the system as "financing", "loan", "deposit", "transfer". Of course, if the first test data set includes other categories of text data, the corresponding first expected result may also be determined based on the categories of text data.

In embodiments of the present description, the first test result and the first expected result may be compared to determine whether the test passed.

In some embodiments, a test pass may be determined in the event that the first test result completely matches a first expected result. For example, the first test result is that the system extracts high-frequency words of 'financing', 'loan', 'deposit', 'transfer', and the first expected result is that the system extracts high-frequency words of 'financing', 'loan', 'deposit', 'transfer', and the first test result completely accords with the first expected result, so that the function of the system for extracting the high-frequency words is determined to be good, and the test is determined to be passed.

In some embodiments, the test may be determined to fail in the event that the first test result does not completely meet the first expected result. For example, if the first test result is that the system does not extract any high-frequency word, or the high-frequency word extracted by the system is completely different from the first preset result, it may be determined that the function of the system for extracting the high-frequency word has a problem, and it is determined that the test does not pass.

In some embodiments, in the event that the first test result does not completely match a first expected result, a test is determined to pass if a portion of the first test result that does not match the first expected result matches a second expected result. Wherein the second expected result may be determined based on a portion of the first test result that does not correspond to the first expected result. Specifically, for example, the first test result extracts a high-frequency word from the system, and includes a debit card in addition to the high-frequency word "financing", "loan", "deposit", "transfer", and the first expected result is that the system extracts a high-frequency word "financing", "loan", "deposit", "transfer", and it can be determined that the first test result does not completely match the first expected result, in which case it cannot be determined that the function of the system for extracting the high-frequency word is a problem, and it is also possible that the text data in the first test data set includes a high-frequency word "debit card", but the high-frequency word "debit card" is absent from the first expected result because the high-frequency word "debit card" is not considered in each category of the text data in the first test data set. In this case, if the system high frequency words function well, each category of text in the first test data set contains the high frequency word "debit card". Thus, it may be determined that the second expected result may be the keyword "debit card" being a high frequency word in the first test data set, and it may be determined whether the system is functioning well to extract the high frequency word by determining whether the keyword "debit card" in the first test data set is a high frequency word.

In some embodiments, determining whether a portion of the first test result that does not correspond to the first expected result corresponds to a second expected result may be based on the following.

S331: and obtaining at least one keyword according to the part of the first test result which is not in accordance with the first expected result.

In some embodiments, the first test data set may include text data of categories financing, loan, deposit, transfer, for example, in terms of functionality of the test system to extract high frequency words. The first test result extracts high-frequency words of 'financing', 'loan', 'deposit', 'transfer' and 'debit card' for the system, the first expected result extracts high-frequency words of 'financing', 'loan', 'deposit' and 'transfer' for the system, and the part of the first test result which does not accord with the first expected result can be determined to be the high-frequency word 'debit card'. Further, the keyword may be determined to be "debit card" based on the high frequency word "debit card". Of course, the above is only the function of the test system to extract high frequency words, and the first test data set may include a specific example of text data of categories of financing, loan, deposit and transfer, and in the embodiment of the present specification, other functions of the test system, and test data sets containing text data of different categories may also be used.

S332: and calculating the occurrence frequency of each keyword in each text data in the first test data set.

In some embodiments, the keywords may be used as category names of the text data, and each preset keyword may also be corresponding to a category of the text data. Specifically, for example, the preset keywords include "financing", "loan", "deposit", "transfer", and the text data in the target data set may be divided into a financing category, a loan category, a deposit category, and a transfer category.

In some embodiments, taking the keywords as "financing", "loan", "deposit" as an example, after obtaining a plurality of preset keywords, a keyword set { F }may be obtained₁,F₂,F₃,...,F_nIn which F₁In terms of "financing", F₂1-loan₃N is 1,2,3.

S333: and judging whether the part of the first test result which is not in accordance with the first expected result conforms to a second expected result or not according to the frequency.

In some embodiments, determining whether a portion of the first test result that does not conform to the first expected result based on the frequency corresponds to a second expected result may include: determining the number of the text data of each keyword according to the frequency and the total frequency of the keywords in the first test data set; and judging whether the quantity of the text data of each keyword and the total frequency of the keywords appearing in the first test data set accord with a second expected result or not. If so, the test may be determined to pass; otherwise the test fails.

In some embodiments, the first test data set may include classes with functionality to extract high frequency words with a test systemText data for financing, loan, deposit, and transfer are examples. Determining that the keyword is a "debit card" based on the portion of the first test result that does not correspond to the first expected result being the high frequency word "debit card". The amount of text data in which the keyword "debit card" occurs and the total frequency with which the keyword "debit card" occurs in said first test data set may be determined from the frequency matrix. In particular, if the frequency matrix f₁,f₂,f₃,...,f_nF in (b) }₁If the corresponding keyword is "debit card", f can be determined₁Number of frequency matrices not equal to zero, will₁The number of frequency matrices that are not zero is determined as the number of text data in which the keyword "debit card" appears; it is also possible to use f in the frequency matrix₁The values of (a) are added to obtain the total frequency of occurrence of the keyword "debit card" in said first test data set. If the number of the text data with the keyword 'debit card' is larger than a preset number and/or the total frequency of the keyword 'debit card' appearing in the first test data set is larger than a preset frequency, determining that the keyword 'debit card' is a high-frequency word in the first test data set, wherein the part of the first test result which does not conform to the first expected result conforms to a second expected result, and the function of extracting the high-frequency word by the system is good after the test is passed; otherwise, the test fails, and the function of the system for extracting the high-frequency words is in problem.

Embodiments of the present description may obtain a first set of test data; the first test data set comprises at least one category of text data; inputting the first test data set into a system to be tested to obtain a first test result; and under the condition that the first test result does not completely accord with a first expected result, if the part of the first test result which does not accord with the first expected result accords with a second expected result, determining that the test is passed. The test method provided by the embodiment of the specification adopts a method of classifying data according to the frequency of occurrence of a plurality of preset keywords in the data to obtain a test data set, and adopts a test method combining two tests, so that the test efficiency can be improved, and the test accuracy can be improved.

The present embodiment provides an example of a scenario, and as shown in fig. 4, fig. 4 is a schematic diagram of an example of a scenario provided by the present embodiment.

In this scenario example, taking a function of the test system to extract the high-frequency words as an example, a target data set is obtained from a data source, the target data set is classified to obtain classified text data, at least one category of text data is selected from the classified text data to serve as the test data set, and the test data set test system extracts the function of the high-frequency words. Specifically, the following steps may be included.

S1: a target data set is acquired.

In this scenario example, the target data may include text data and/or audio data.

S2: desensitizing the target data set.

In the present scenario example, the target data set obtained from the banking system typically relates to security data of the customer or some business sensitive data. In order to protect sensitive private data, data in the target data set can be used after being modified, for example, data desensitization processing is performed on personal information such as identity card numbers, mobile phone numbers, card numbers, customer numbers and the like in the target data set.

S3: and judging whether the target data set is text data or not.

If so, S5 is performed, otherwise S4 is performed.

S4: the audio data is converted into corresponding text data.

In the present scene example, the content expressed in the audio data may be output in the form of text by the speech recognition technique, the corresponding text data is obtained, and S5 is performed.

S5: the text data is classified.

In the present scenario example, a plurality of preset keywords "financing", "loan", "deposit", "transfer", "remittance" may be acquired, and the frequency of occurrence of each preset keyword in each text data may be calculated.

In this scenario example, a preset keyword may be used as a category name of the text data. And determining the types of the text data in the target data set to be undetermined, financing, loan, deposit, transfer, remittance and multiple types according to the frequency.

S6: and selecting at least one category of text data in the classified target data set as a test data set.

In the present scenario example, text data of categories financing, loan, deposit, transfer may be used as the test data set.

S7: the test system extracts the function of the high frequency words.

Specifically, the test data set may be input into the system to obtain an output result.

S8: and judging whether the output result completely accords with the first expected result.

In the present scenario example, the first expected result is the system extracting the high frequency words "financing", "loan", "deposit", "transfer". If the high-frequency words 'financing', 'loan', 'deposit', 'transfer' are extracted from the system in the output result, the output result completely accords with a first expected result, and the test is passed; if the output result extracts the high frequency words "financing", "loan", "deposit", "transfer", "debit card" for the system, the output result does not completely match the first expected result, and S9 may be performed.

S9: the category name is determined based on a portion of the output result that does not correspond to the first expected result.

In the present scenario example, if the part of the output result that does not match the first expected result extracts the high-frequency word "debit card" for the system, it can be determined that "debit card" is a keyword, and "debit card" is used as the category name of the text data.

S10: and classifying the test data set according to the determined category name.

In the present scenario example, the frequency of occurrence of the keyword "debit card" in the text data of the test data set may be calculated, and the text data in the test data set may be reclassified according to the frequency determination, and the new category may be pending, financing, loan, deposit, transfer, debit card, or multiple categories.

S11: and judging whether the newly classified text data meets a second expected result.

In the present scenario example, it may be derived from the portion of the output result that does not correspond to the first expected result, and the second expected result is the keyword "debit card" which is a high frequency word in the test data set.

In the present scenario example, the total frequency of occurrence of the keyword "debit card" in the newly classified text data, the debit card category, and the multi-category text data may be calculated, and it is determined whether the second expected result is met based on the calculation result. Specifically, if the number of the text data of the debit card category is greater than the preset number and/or the total frequency of occurrence of the keyword 'debit card' in the text data of the multiple categories is greater than the preset frequency, it may be determined that the keyword 'debit card' is a high-frequency word in the test data set, a portion of the output result that does not conform to the first expected result conforms to a second expected result, and the function of the system for extracting the high-frequency word is good when the test is passed; otherwise, the test fails, and the function of the system for extracting the high-frequency words is in problem.

Embodiments of the present specification further provide a computer-readable storage medium of a data classification method, where the computer-readable storage medium stores computer program instructions, and when the computer program instructions are executed, the computer program instructions implement: acquiring a target data set; the target dataset comprises a plurality of text data; calculating the occurrence frequency of a plurality of preset keywords in each text data; and determining the category of each text data in the target data set according to the frequency.

In the present embodiment, the storage medium includes, but is not limited to, a Random Access Memory (RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard disk (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. In this embodiment, the functions and effects specifically realized by the program instructions stored in the computer-readable storage medium can be explained by comparing with other embodiments, and are not described herein again.

Referring to fig. 5, on a software level, an embodiment of the present specification further provides a data classification apparatus, which may specifically include the following structural modules.

An obtaining module 510, configured to obtain a target data set; the target dataset comprises a plurality of text data;

a calculating module 520, configured to calculate a frequency of occurrence of a plurality of preset keywords in each text data;

a classification module 530, configured to determine a category of each text data in the target data set according to the frequency.

In some embodiments, the classification module 530 may include: the first classification submodule is used for determining the type of the text data as an undetermined type under the condition that the occurrence frequency of each preset keyword is a preset frequency; the second classification submodule is used for determining the category of the text data as the category corresponding to the keyword of which the occurrence frequency is not the preset frequency under the condition that the occurrence frequency of only one preset keyword is not the preset frequency and the occurrence frequencies of other preset keywords are the preset frequencies; and the third classification submodule is used for determining that the classification of the text data is a multi-classification under the condition that the occurrence frequency of at least two preset keywords is not the preset frequency.

Embodiments of the present specification also provide a computer-readable storage medium of a testing method, where the computer-readable storage medium stores computer program instructions, and when the computer program instructions are executed, the computer program instructions implement: acquiring a first test data set; the first test data set comprises at least one category of text data; inputting the first test data set into a system to be tested to obtain a first test result; and under the condition that the first test result does not completely accord with a first expected result, if the part of the first test result which does not accord with the first expected result accords with a second expected result, determining that the test is passed.

Referring to fig. 6, on a software level, the embodiment of the present specification further provides a testing apparatus, which may specifically include the following structural modules.

An obtaining module 610, configured to obtain a first test data set; the first test data set comprises at least one category of text data;

the first testing module 620 is configured to input the first testing data set into a system to be tested, and obtain a first testing result;

the second testing module 630 is configured to, if the first testing result does not completely meet the first expected result, pass the test if a portion of the first testing result that does not meet the first expected result meets a second expected result.

In some embodiments, the apparatus may further comprise: and the first determination module is used for determining that the test is passed under the condition that the first test result completely accords with the first expected result.

In some embodiments, the apparatus may further comprise: a second determination module to determine that the test failed if the first test result does not completely meet the first expected result.

In some embodiments, the apparatus may further comprise: and the third determining module is used for determining that the test is not passed under the condition that the part which is not in accordance with the first expected result in the first test result does not conform to the second expected result.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and the same or similar parts in each embodiment may be referred to each other, and each embodiment focuses on differences from other embodiments. In particular, as for the apparatus embodiment and the apparatus embodiment, since they are substantially similar to the method embodiment, the description is relatively simple, and reference may be made to some descriptions of the method embodiment for relevant points.

After reading this specification, persons skilled in the art will appreciate that any combination of some or all of the embodiments set forth herein, without inventive faculty, is within the scope of the disclosure and protection of this specification.

In the 90 s of the 20 th century, improvements in a technology could clearly distinguish between improvements in hardware (e.g., improvements in circuit structures such as diodes, transistors, switches, etc.) and improvements in software (improvements in process flow). However, as technology advances, many of today's process flow improvements have been seen as direct improvements in hardware circuit architecture. Designers almost always obtain the corresponding hardware circuit structure by programming an improved method flow into the hardware circuit. Thus, it cannot be said that an improvement in the process flow cannot be realized by hardware physical modules. For example, a Programmable Logic Device (PLD), such as a Field Programmable Gate Array (FPGA), is an integrated circuit whose Logic functions are determined by programming the Device by a user. A digital system is "integrated" on a PLD by the designer's own programming without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Furthermore, nowadays, instead of manually making an integrated Circuit chip, such Programming is often implemented by "logic compiler" software, which is similar to a software compiler used in program development and writing, but the original code before compiling is also written by a specific Programming Language, which is called Hardware Description Language (HDL), and HDL is not only one but many, such as abel (advanced Boolean Expression Language), ahdl (alternate Language Description Language), traffic, pl (core unified Programming Language), HDCal, JHDL (Java Hardware Description Language), langue, Lola, HDL, laspam, hardward Description Language (vhr Description Language), vhjhdul, vhr Description Language, and vhr-Language, which are currently used in most common. It will also be apparent to those skilled in the art that hardware circuitry that implements the logical method flows can be readily obtained by merely slightly programming the method flows into an integrated circuit using the hardware description languages described above.

The systems, devices, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or by a product with certain functions. One typical implementation device is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smartphone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

From the above description of the embodiments, it is clear to those skilled in the art that the present specification can be implemented by software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the present specification may be essentially or partially implemented in the form of software products, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments of the present specification.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The description is operational with numerous general purpose or special purpose computing system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet-type devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

This description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

While the specification has been described with examples, those skilled in the art will appreciate that there are numerous variations and permutations of the specification that do not depart from the spirit of the specification, and it is intended that the appended claims include such variations and modifications that do not depart from the spirit of the specification.

Claims

1. A method of data classification, the method comprising:

acquiring a target data set; the target dataset comprises a plurality of text data;

calculating the occurrence frequency of a plurality of preset keywords in each text data;

and determining the category of each text data in the target data set according to the frequency.

2. The method of claim 1, wherein the target data set further comprises audio data; correspondingly, the method also comprises the following steps: the audio data is converted into text data.

3. The method according to claim 1, wherein the category name of the text data includes the preset keyword.

4. The method of claim 1, wherein the determining a category for each text data in the target dataset according to the frequency comprises:

and under the condition that the occurrence frequency of each preset keyword is less than the preset frequency, determining the category of the text data to be an undetermined category.

5. The method of claim 1, wherein the determining a category for each text data in the target dataset according to the frequency comprises:

and under the condition that the frequency of occurrence of only one preset keyword is greater than or equal to the preset frequency and the frequency of occurrence of other preset keywords is less than the preset frequency, determining the category of the text data as the category corresponding to the keyword of which the frequency of occurrence is greater than or equal to the preset frequency.

6. The method of claim 1, wherein the determining a category for each text data in the target dataset according to the frequency comprises:

and determining the category of the text data to be a plurality of categories under the condition that the occurrence frequency of at least two preset keywords is greater than or equal to the preset frequency.

7. A method of testing, the method comprising:

acquiring a first test data set; the first test data set comprises at least one category of text data;

inputting the first test data set into a system to be tested to obtain a first test result;

and under the condition that the first test result does not completely accord with a first expected result, if the part of the first test result which does not accord with the first expected result accords with a second expected result, determining that the test is passed.

8. The method of claim 7, wherein a test is determined to pass if the first test result completely matches a first expected result.

9. The method of claim 7, wherein the test is determined to fail if the first test result does not completely meet the first expected result.

10. The method of claim 7, wherein a test is determined to fail if the portion of the first test result that does not match the first expected result does not match a second expected result.

11. The method of claim 7, wherein said obtaining a first test data set comprises:

determining the category of each text data in the target data set according to the frequency;

and acquiring text data of at least one category in the classified target data set as the first test data set.

12. The method of claim 7, wherein determining whether the portion of the first test result that does not correspond to the first expected result corresponds to a second expected result is based on:

obtaining at least one keyword according to the part of the first test result which is not in accordance with the first expected result;

calculating the occurrence frequency of each keyword in each text data in the first test data set;

and judging whether the part of the first test result which is not in accordance with the first expected result conforms to a second expected result or not according to the frequency.

13. The method of claim 12, wherein said determining whether the portion of the first test result that does not correspond to the first expected result matches a second expected result based on the frequency comprises:

determining the number of the text data of each keyword according to the frequency and the total frequency of the keywords in the first test data set;

and judging whether the quantity of the text data of each keyword and the total frequency of the keywords appearing in the first test data set accord with a second expected result or not.

14. An apparatus for classifying data, the apparatus comprising:

an acquisition module for acquiring a target data set; the target dataset comprises a plurality of text data;

the calculation module is used for calculating the occurrence frequency of a plurality of preset keywords in each text data;

and the classification module is used for determining the category of each text data in the target data set according to the frequency.

15. The apparatus of claim 14, the classification module comprising:

and the first classification submodule is used for determining the type of the text data as the undetermined type under the condition that the occurrence frequency of each preset keyword is the preset frequency.

16. The apparatus of claim 14, the classification module comprising:

and the second classification submodule is used for determining that the category of the text data is the category corresponding to the keyword of which the occurrence frequency is not the preset frequency under the condition that the occurrence frequency of only one preset keyword is not the preset frequency and the occurrence frequencies of other preset keywords are the preset frequencies.

17. The apparatus of claim 14, the classification module comprising:

and the third classification submodule is used for determining that the classification of the text data is a multi-classification under the condition that the occurrence frequency of at least two preset keywords is not the preset frequency.

18. A test apparatus, the apparatus comprising:

an acquisition module for acquiring a first test data set; the first test data set comprises at least one category of text data;

the first testing module is used for inputting the first testing data set into a system to be tested and acquiring a first testing result;

and the second testing module is used for passing the test if the part of the first testing result which is not in accordance with the first expected result accords with a second expected result under the condition that the first testing result does not completely accord with the first expected result.

19. The apparatus of claim 18, further comprising:

and the first determination module is used for determining that the test is passed under the condition that the first test result completely accords with the first expected result.

20. The apparatus of claim 18, further comprising:

a second determination module to determine that the test failed if the first test result does not completely meet the first expected result.

21. The apparatus of claim 18, further comprising:

and the third determining module is used for determining that the test is not passed under the condition that the part which is not in accordance with the first expected result in the first test result does not conform to the second expected result.

22. A computer readable storage medium having computer program instructions stored thereon that when executed implement: acquiring a target data set; the target dataset comprises a plurality of text data; calculating the occurrence frequency of a plurality of preset keywords in each text data; and determining the category of each text data in the target data set according to the frequency.

23. A computer readable storage medium having computer program instructions stored thereon that when executed implement: acquiring a first test data set; the first test data set comprises at least one category of text data; inputting the first test data set into a system to be tested to obtain a first test result; and under the condition that the first test result does not completely accord with a first expected result, if the part of the first test result which does not accord with the first expected result accords with a second expected result, determining that the test is passed.