CN110941719B

CN110941719B - Data classification method, testing method, device and storage medium

Info

Publication number: CN110941719B
Application number: CN201911214205.9A
Authority: CN
Inventors: 杨玉; 刘华英; 刘燕; 李凤亭; 梁雨霏; 刘晓刚
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2019-12-02
Filing date: 2019-12-02
Publication date: 2023-12-19
Anticipated expiration: 2039-12-02
Also published as: CN110941719A

Abstract

The embodiment of the specification provides a data classification method, a testing method, a device and a storage medium. The method comprises the following steps: acquiring a target data set; the target dataset includes a plurality of text data; calculating the occurrence frequency of a plurality of preset keywords in each text data; and determining the category of each text data in the target data set according to the frequency. According to the vehicle self-adaptive cruise control method provided by the embodiment of the specification, the data can be classified according to the occurrence frequency of a plurality of preset keywords in the data, so that the accuracy of data classification is improved, a large amount of data can be automatically classified, and the data classification efficiency is also improved.

Description

Data classification method, testing method, device and storage medium

Technical Field

Embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a data classification method, a test method, a device, and a storage medium.

Background

The intelligent customer service system collects mass voice data and text data of a customer service center through customer voices, sorts and refines valuable information through big data analysis, pushes the valuable information to a business department, continuously improves products and services, and pushes the customer service center to change from a service department to a decision support department and from an after-sales service link to a whole-course service participation change. Wherein the big data analysis is performed by an analysis system. Before the big data analysis system is applied to China banking, a software center test strip line is required to perform functional test on the analysis system, whether functions of the analysis system can be normally used or not is verified, a large amount of data is required to support, and the larger the data amount is, the more sufficient the big data analysis system is tested, such as clustering and classifying functions, business model building functions, high-frequency word extracting functions, hot word frequency duty ratio and the like.

In the existing testing methods for big data analysis systems, required testing data, namely sample data, is obtained by manually sampling an obtained data source, and then related functions are verified by using the sample data. And the test data is obtained by manual sampling, and each piece of text data is required to be classified and screened, so that the time consumption is relatively long, the efficiency is low and the cost is high under the condition of relatively large data quantity. In the test process, the accuracy and efficiency of the test result are affected because the accuracy of data classification in the test data is not high.

Disclosure of Invention

An object of the embodiments of the present disclosure is to provide a data classification method, a test method, a device, and a storage medium, so as to improve accuracy and efficiency of data classification and accuracy and efficiency of system testing.

In order to solve the above-mentioned problems, embodiments of the present disclosure provide a data classification method, a test method, an apparatus, and a storage medium.

A method of data classification, the method comprising: acquiring a target data set; the target dataset includes a plurality of text data; calculating the occurrence frequency of a plurality of preset keywords in each text data; and determining the category of each text data in the target data set according to the frequency.

A method of testing, the method comprising: acquiring a first test data set; the first test dataset includes at least one category of text data; inputting the first test data set into a system to be tested, and obtaining a first test result; and if the first test result does not completely accord with the first expected result, determining that the test is passed if the part of the first test result which does not accord with the first expected result accords with the second expected result.

A computer readable storage medium having stored thereon computer program instructions that when executed implement: acquiring a target data set; the target dataset includes a plurality of text data; acquiring a plurality of preset keywords, and calculating the occurrence frequency of each preset keyword in each text data; and determining the category of each text data in the target data set according to the frequency.

A computer readable storage medium having stored thereon computer program instructions that when executed implement: acquiring a first test data set; the first test dataset includes at least one category of text data; inputting the first test data set into a system to be tested, and obtaining a first test result; and if the first test result does not completely accord with the first expected result, determining that the test is passed if the part of the first test result which does not accord with the first expected result accords with the second expected result.

As can be seen from the technical solutions provided by the embodiments of the present specification, the embodiments of the present specification may acquire a target data set; the target dataset includes a plurality of text data; acquiring a plurality of preset keywords, and calculating the occurrence frequency of each preset keyword in each text data; and determining the category of each text data in the target data set according to the frequency. The data classification method provided by the embodiment of the specification can classify the data according to the occurrence frequency of a plurality of preset keywords in the data, improves the accuracy of data classification, further can automatically classify a large amount of data, and improves the efficiency of data classification.

The embodiment of the specification can acquire a first test data set; the first test dataset includes at least one category of text data; inputting the first test data set into a system to be tested, and obtaining a first test result; and if the first test result does not completely accord with the first expected result, determining that the test is passed if the part of the first test result which does not accord with the first expected result accords with the second expected result. The test method for the system provided by the embodiment of the specification adopts a test method combining two tests, so that the test efficiency can be improved, and the test accuracy can be improved.

Drawings

In order to more clearly illustrate the embodiments of the present description or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present description, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a data classification method according to an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of an example of a scenario in an embodiment of the present disclosure;

FIG. 3 is a flow chart of a testing method according to an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of an example of a scenario in an embodiment of the present disclosure;

FIG. 5 is a functional block diagram of a data sorting device according to an embodiment of the present disclosure;

fig. 6 is a functional block diagram of a testing device according to an embodiment of the present disclosure.

Detailed Description

The technical solutions of the embodiments of the present specification will be clearly and completely described below with reference to the drawings in the embodiments of the present specification, and it is apparent that the described embodiments are only some embodiments of the present specification, not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are intended to be within the scope of the present disclosure.

The existing text classification method can generally utilize an artificial intelligence method to establish a classification model, and data is classified through the classification model. Wherein the classification model is generally established based on algorithms such as a Support Vector Machine (SVM), a naive Bayesian classifier, a K-nearest neighbor (KNN), a decision tree, a random forest, etc. However, the classification of data by establishing a classification model through an artificial intelligence method usually has uncertainty, namely a certain false positive rate, so that the accuracy of data classification is not enough. Therefore, a more accurate text classification method is needed to classify text.

In this embodiment, the main body for executing the data classification method may be an electronic device with a logic operation function, and the electronic device may be a server or a client, and the client may be a desktop computer, a tablet computer, a notebook computer, a smart phone, a workstation, or the like. Of course, the client is not limited to the electronic device with a certain entity, and may be software running in the electronic device. And can also be a program software formed by program development, which can be run in the above-mentioned electronic device.

Fig. 1 is a flowchart of a data classification method according to an embodiment of the present disclosure. As shown in fig. 1, the data classification method may include the following steps.

S110: acquiring a target data set; the target dataset includes a plurality of text data.

In some embodiments, the target data set may include a plurality of text data, such as text data in XML, HTML, or other formats. The target data set may also include audio data, such as AIFF, MP3 audio data.

In some embodiments, the target data set may be obtained from a data source, such as downloading the target data set from a specified database. For example, in a banking system, a database in an intelligent customer service platform may be used as a data source to obtain audio data and/or text data. The audio data can be data of audio class of voice recording of the call between the seat and the client; the text data may be data of a text class such as text interaction of a seat with a customer, collected customer opinions or suggestions, and the like.

In some embodiments, the target data set obtained from the banking system generally relates to the customer's security data or some commercially sensitive data. In order to protect sensitive privacy data, the data in the target data set can be modified for use, for example, personal information such as an identity card number, a mobile phone number, a card number, a client number and the like in the target data set is subjected to data desensitization treatment.

In some embodiments, if the target dataset includes audio data, the audio data may also be converted to text data. Specifically, the content expressed in the audio data may be output in the form of text through a voice recognition technology.

In some embodiments, after the target data set is acquired, it may be determined whether the data in the target data set is text data or audio data. If the data in the target dataset is text data, a set { T of text data may be obtained ₁ ,T ₂ ,T ₃ ,...T _m }, wherein T is _m (m=1, 2, 3.) represents text data. If the data in the target data set comprises a set of audio data S ₁ ,S ₂ ,S ₃ ,...S _n The audio data can be converted into corresponding text data to obtain a set of converted text dataWherein S is _n (n=1, 2, 3.) represents audio data,/-for>Representing audio data S _n Corresponding text data. Further, a target data set { T } containing only text data can be obtained ₁ ,T _S1 ,T ₂ ,T _S2 ,T ₃ ,T _S3 ,...T _m ,T _Sn }。

S120: the frequency of occurrence of a plurality of preset keywords in each text data is calculated.

In some embodiments, the preset keywords may be used to characterize the category of text data in the target dataset. For example, the category name of the text data may include the preset keyword, or the category name of the text data may correspond to the preset keyword. Specifically, if the category names of the text data in the target data set are determined as a financial category, a loan category, a deposit category, and a transfer category, the plurality of preset keywords may include "financial", "loan", "deposit", "transfer", that is, in this case, the keywords may be regarded as the category names of the text data. If the category names of the text data in the target data set are determined as an a category, a b category, a c category and a d category, the a category may be determined as a category corresponding to a preset keyword "financial", the b category is determined as a category corresponding to a preset keyword "loan", the c category is determined as a category corresponding to a preset keyword "deposit", and the d category is determined as a category corresponding to a preset keyword "transfer".

In the embodiment of the present disclosure, the server may acquire a plurality of preset keywords in any manner. For example, the user may input a plurality of preset keywords, and the server may receive the keywords; for another example, other electronic devices except the server may send a plurality of preset keywords to the server, and the server may receive the keywords.

In some embodiments, taking the example that the plurality of preset keywords include "financing", "loan" and "deposit", after obtaining the plurality of preset keywords, a keyword set { F } ₁ ,F ₂ ,F ₃ ,...,F _n }, wherein F ₁ = "financing", F ₂ = "loan", F ₃ = "deposit", n=1, 2,3.

In the embodiment of the present disclosure, the server may identify text content in text data, and may calculate the occurrence frequency of each preset keyword in each text data by identifying each text data. In particular, the server may read a target dataset { T }, which contains text data ₁ ,T _S1 ,T ₂ ,T _S2 ,T ₃ ,T _S3 ,...T _m ,T _Sn Respectively calculating the occurrence frequency of each preset keyword in each text data to obtain a frequency matrix { f } ₁ ,f ₂ ,f ₃ ,...,f _n }. Wherein each text corresponds to a frequency matrix; f (f) _n Corresponds to the keyword F _n Frequency of occurrence, n=1, 2,3.

S130: and determining the category of each text data in the target data set according to the frequency.

In the embodiment of the present specification, the category of each text data in the target data set may be determined according to the occurrence frequency of each preset keyword in each text data. Specifically, the category of the text data may be determined as a pending category, a single category, a multi-category. The undetermined category indicates that the text data does not belong to any one of categories corresponding to preset keywords; the single category indicates that the text data belongs to a text category corresponding to a certain preset keyword; the multi-category indicates that the text data can belong to text categories corresponding to at least two preset keywords at the same time.

In the embodiment of the present disclosure, if the frequency of occurrence of each preset keyword in the text data is smaller than the preset frequency, the text data does not belong to any one of the categories corresponding to the preset keywords; if the frequency of occurrence of a certain preset keyword in the text data is greater than or equal to the preset frequency and the frequency of occurrence of other preset keywords is less than the preset frequency, the category of the text data is a single category, and the text data belongs to the category corresponding to the keyword with the frequency greater than or equal to the preset frequency; if the occurrence frequency of at least two preset keywords in the text data is greater than or equal to the preset frequency, the category of the text data is multiple.

In some embodiments, the preset frequency may be zero. Specifically, if the occurrence frequency of each preset keyword in the text data is zero, it is obvious that the text data does not belong to any one of the categories corresponding to the preset keywords; if the frequency of occurrence of a certain preset keyword in the text data is greater than zero and the frequency of occurrence of other preset keywords is equal to zero, the category of the text data is a single category, and the text data belongs to the category corresponding to the keyword with the frequency of occurrence greater than zero; if the occurrence frequency of at least two preset keywords in the text data is greater than zero, the category of the text data is multi-category. In the embodiment of the present specification, in order to determine the category of the text data more accurately, the preset frequency may also be any value greater than zero, for example, may be 1, 3, 10, etc., which is not limited in the present specification.

In the following, in the case where the preset frequency is zero, how the category of each text data in the target data set is determined according to the frequency in the embodiment of the present specification will be described. In some embodiments, for each text data, if the frequency of occurrence of each preset keyword is zero, the text data does not belong to any one of the corresponding categories of each preset keyword, and the category of the text data may be determined as the pending category. Specifically, the server may determine whether the frequency matrix of the text data is equal to zero, if so, { f ₁ ＝0,f ₂ ＝0,f ₃ ＝0,...,f _n =0 }, the category of the text data may be determined as a pending category.

In some embodiments, for each text data, if the frequency of occurrence of each preset keyword is not zero, that is, one or more preset keywords appear in the text data, it may be further determined whether the category to which the text data belongs is single. If only one of a plurality of preset keywords appears in the text data, judging that the category of the text data is single if none of the other preset keywords appears, otherwise, determining that the category of the text data with the non-single category is multiple.

In some embodiments, the server may determine whether the category to which the text data belongs is singular through a frequency matrix of the text data. In particular, if there is one and only one value other than zero in the frequency matrix, e.g. frequency matrix { f ₁ ＝0,f ₂ ＝0,f ₃ ＝r...，f _n =0, where r= {1,2,3 … }, it can be determined that the category to which the text data belongs is single. If not in the frequency matrixValues of zero are not unique, e.g. frequency matrix { f ₁ ＝0,f ₂ ＝0,f ₃ ＝0,f ₄ ＝k,f ₅ ＝j,...,f _n =0 }, where k and j are not 0, it can be determined that the category of the text data is not uniform.

In some embodiments, if the category of the text data is single, that is, only one preset keyword appears at a frequency different from zero, and other preset keywords appear at a frequency different from zero, the keyword with the frequency different from zero may be recorded, and the category of the text data is determined as the category corresponding to the keyword with the frequency different from zero. For example, a frequency matrix { f ₁ ＝0,f ₂ ＝0,f ₃ ＝r...，f _n =0, where r= {1,2,3 … }, the text data is determined as f ₃ Corresponds to the keyword F ₃ And determining the category of the text data as a deposit category if a preset keyword is used as the category name of the text data.

In some embodiments, if the category of the text data is not uniform, i.e., the frequency of occurrence of at least two preset keywords is not zero, keywords having a frequency of occurrence other than zero may be recorded, and the category of the text data may be determined as a multi-category. For example, a frequency matrix { f ₁ ＝0,f ₂ ＝0,f ₃ ＝0,f ₄ ＝k,f ₅ ＝j,...,f _n =0, wherein k and j are not 0, the number of non-zero frequencies can be calculated to obtain the number of different preset keywords of the text data, and the frequency f is recorded ₃ Corresponding keyword F ₃ And frequency f ₄ Corresponding keyword F ₄ And determining the category of the text data as a multi-category.

The embodiment of the specification can acquire a target data set; the target dataset includes a plurality of text data; acquiring a plurality of preset keywords, and calculating the occurrence frequency of each preset keyword in each text data; and determining the category of each text data in the target data set according to the frequency. The data classification method provided by the embodiment of the specification can classify the data according to the occurrence frequency of a plurality of preset keywords in the data, improves the accuracy of data classification, further can automatically classify a large amount of data, and improves the efficiency of data classification.

The present embodiment provides a scene example, as shown in fig. 2, and fig. 2 is a schematic diagram of the scene example provided in the present embodiment.

In this scenario example, the user may input a preset keyword, and the server may receive the preset keyword as a category name of the text data.

Specifically, in this scenario example, the preset keywords may include "financial," "loan," and "deposit," respectively. The server may determine the category name of the text data as financial, loan, deposit according to a preset keyword, and of course, the category name of the text data also includes pending and multiple categories.

In this scenario example, the server may create a folder corresponding to a category name of the text data, for example, designate a folder corresponding to a financial category as financial, designate a folder corresponding to a loan category as loan, designate a folder corresponding to a deposit category as deposit, designate a folder corresponding to a pending category as pending, and designate a folder corresponding to a multi-category as multi-category.

In this scenario example, the server may obtain a preset keyword set, that is, a set { F of text data category names ₀ ,F ₁ ,F ₂ ,F ₃ ,...,F _n+1 }, wherein F ₀ = "pending", F ₁ = "financing", F ₂ = "loan", F ₃ = "deposit", …, F _n+1 = "multi-category". The server may also create a "pending", "financing", "loan", "deposit" … "multi-category" folder corresponding to the category name of the text data under a preset storage path.

In this scenario example, the frequency of occurrence of each category name in each text may be calculated, resulting in a frequency matrix. In particular, the server may readTake text data { T ₁ ,T _S1 ,T ₂ ,T _S2 ,T ₃ ,T _S3 ,...T _m ,T _Sn Respectively calculating the custom 'category name' { F }, which appears in each text ₁ ,F ₂ ,F ₃ ,...,F _n Frequency of { f) to obtain a frequency matrix { f } ₁ ,f ₂ ,f ₃ ,...,f _n }. Wherein each text corresponds to a frequency matrix.

In this scenario example, the preset frequency may be zero, and after the frequency matrix is obtained, it may be determined whether the frequency matrix is zero. Specifically, the server may filter out text data with a frequency other than zero. If the frequency matrix corresponding to the text data is zero, namely { f ₁ ＝0,f ₂ ＝0,f ₃ ＝0,...,f _n =0 }, the text data can be placed under the "pending" folder; if the frequency matrix corresponding to the text data is not zero, the next step is carried out.

In this scenario example, if the frequency matrix corresponding to the text data is not zero, it is determined whether the category to which the text data belongs is single. Specifically, if the text belongs to a single category, that is, there is only one non-zero value in the frequency matrix corresponding to the text data, for example, the frequency matrix { f ₁ ＝0,f ₂ ＝0,f ₃ ＝r...，f _n =0, where r= {1,2,3 … }, the text data belongs to category F ₃ = "deposit", put the text data under the corresponding "deposit" folder, and record the corresponding frequency value f ₃ =r. If the text is not of a single category, i.e. the non-zero value in the frequency matrix is not unique, e.g. the frequency matrix { f ₁ ＝0,f ₂ ＝0,f ₃ ＝0，f ₄ ＝k，f ₅ ＝j，...，f _n =0, where k and j are not 0, the number of classes to which the text data belongs is calculated, and the number λ of classes to which the text data belongs is obtained, where λ=2 in this scenario example, and names of all classes are recorded at the same time, and F in this scenario example may be recorded ₄ 、F ₅ And the corresponding category names, and putting the text data under a multi-category file.

In this scenario example, the server may also determine whether to traverse all text data { T } ₁ ,T _S1 ,T ₂ ,T _S2 ,T ₃ ,T _S3 ,...T _m ,T _Sn }. If all the text data have been traversed, ending; if not, continuing the classification process of the text data.

Fig. 3 is a flowchart of a testing method according to an embodiment of the present disclosure. As shown in fig. 3, the test method may include the following steps.

S310: acquiring a first test data set; the first test dataset includes at least one category of text data;

big data analysis refers to analysis of data of huge scale. Big data can be summarized as 5V, namely, large data Volume (Volume), fast speed (Velocity), multiple types (Variety), value (Value), authenticity (Veracity). The big data analysis can be performed by an analysis system, and before the big data analysis system is applied, functional tests can be performed on the analysis system, such as a clustering and classifying function of the analysis system, a business model building function, a high-frequency word extracting function, a hot word frequency duty ratio function and the like. Tests are performed to verify that some functions of the analysis system are properly functioning, and that a large amount of test data is required for support, and that the larger the amount of data, the more adequate the test of the large data analysis system.

In the embodiment of the present disclosure, different functions of the test analysis system need to use different types of test data, for example, a clustering function and a classifying function of the test analysis system, so that classified data is required to be used as test data, and if the test analysis system extracts high-frequency words, data containing some high-frequency words is required to be used as test data.

In some embodiments, the first test data set may be employed as test data for testing the functionality of the analysis system. Wherein the test dataset may comprise at least one category of text data.

In some embodiments, the first test dataset may include classified text data such that the expected results of the test may be calculated based on the classification of the text data. For example, to test the high frequency word extraction function of the analysis system, text data containing different keywords may be classified into different categories, and the text data may be used as a first test data set, and the expected result of the test may be determined according to the category of the text data.

In some embodiments, the first test data set may be obtained according to the following steps.

S311: acquiring a target data set; the target dataset includes a plurality of text data.

In some embodiments, the target data set may include text data, such as text data in XML, HTML, or other formats. The target data set may also include audio data, such as AIFF, MP3 audio data.

In some embodiments, the target data set may be obtained from a data source, such as downloading the target data set from a specified database. Specifically, in a banking system, voice data and/or text data can be obtained from a database in the intelligent customer service platform as a data source. The voice data can be audio data of a call record of a seat and a client; the text data may be text-like data such as text interactions of the agent with the customer, collected customer opinions or suggestions, and the like.

In some embodiments, after the target data set is acquiredIt is possible to judge whether the data in the target data set is text data or audio data. If the data in the target dataset is text data, a set { T of text data may be obtained ₁ ,T ₂ ,T ₃ ,...T _m }, wherein T is _m (m=1, 2, 3.) represents text data. If the data in the target data set comprises a set of audio data S ₁ ,S ₂ ,S ₃ ,...S _n The audio data can be converted into corresponding text data to obtain a set of converted text dataWherein S is _n (n=1, 2, 3.) represents audio data,/-for>Representing audio data S _n Corresponding text data. Further, a target data set { T } containing only text data can be obtained ₁ ,T _S1 ,T ₂ ,T _S2 ,T ₃ ,T _S3 ,...T _m ,T _Sn }。

S312: the frequency of occurrence of a plurality of preset keywords in each text data is calculated.

S313: and determining the category of each text data in the target data set according to the frequency.

In the following, in the case where the preset frequency is zero, how the category of each text data in the target data set is determined according to the frequency in the embodiment of the present specification will be described.

In some embodiments, for each text data, if the frequency of occurrence of each preset keyword is zero, the text data does not belong to any one of the corresponding categories of each preset keyword, and the category of the text data may be determined as the pending category. Specifically, the server may determine whether the frequency matrix of the text data is equal to zero, if so, { f ₁ ＝0,f ₂ ＝0,f ₃ ＝0,...,f _n =0 }, the category of the text data may be determined as a pending category.

In some embodiments, the server may determine whether the category to which the text data belongs is singular through a frequency matrix of the text data. In particular, if there is one and only one value other than zero in the frequency matrix, e.g. frequency matrix { f ₁ ＝0,f ₂ ＝0,f ₃ ＝r...，f _n =0, where r= {1,2,3 … }, it can be determined that the category to which the text data belongs is single. If the value of the frequency matrix other than zero is not unique, e.g. frequency matrix { f ₁ ＝0,f ₂ ＝0,f ₃ ＝0,f ₄ ＝k,f ₅ ＝j,...,f _n =0 }, where k and j are not 0, it can be determined that the category of the text data is not uniform.

In some embodiments, if the category of text data is single, i.e. only one preset keyword appears at a frequency other than zero, keywords that appear at a frequency other than zero may be recorded and the text may be recordedAnd determining the category of the data as the category corresponding to the keyword with the occurrence frequency of non-zero. For example, a frequency matrix { f ₁ ＝0,f ₂ ＝0,f ₃ ＝r...，f _n =0, where r= {1,2,3 … }, the text data is determined as f ₃ Corresponds to the keyword F ₃ And determining the category of the text data as a deposit category if a preset keyword is used as the category name of the text data.

S314: and acquiring text data of at least one category in the classified target data set as the first test data set.

In this illustrative embodiment, text data of at least one category from the classified target data set may be used as the first test data set as required by the test analysis system function.

S320: and inputting the first test data set into a system to be tested, and obtaining a first test result.

In this embodiment of the present disclosure, the first test result is an output result of the test system after the first test data set is input into the system to be tested.

S330: and if the first test result does not completely accord with the first expected result, determining that the test is passed if the part of the first test result which does not accord with the first expected result accords with the second expected result.

In this specification embodiment, the first expected result may be determined from the first test dataset. Specifically, taking a function of extracting high-frequency words by the test system, the first test data set may include text data with the categories of financial, loan, deposit and transfer as an example. Wherein, the text data of each category contains corresponding high frequency words, for example, the text data of the financial category contains high frequency words of financial, and the text data of the loan category contains high frequency words of loan. And under the condition that the system functions well, after the first test data set is input into the system, the output result of the system is that the high-frequency words such as financial accounting, loan, deposit and transfer are extracted. Thus, it may be determined that the first expected result of using the first test dataset is that the system has extracted high frequency words of "financing", "loan", "deposit", "transfer". Of course, if the first test dataset includes other categories of text data, the corresponding first expected result may also be determined based on the category of text data.

In embodiments of the present description, the first test result and the first expected result may be compared to determine whether the test passes.

In some embodiments, where the first test result fully meets the first expected result, a test pass may be determined. For example, the first test result is that the system extracts the high-frequency words such as "financial management", "loan", "deposit" and "transfer", and the first expected result is that the system extracts the high-frequency words such as "financial management", "loan", "deposit" and "transfer", and the first test result completely accords with the first expected result, so that the system can be determined to have good function of extracting the high-frequency words, and the test is determined to pass.

In some embodiments, in the event that the first test result does not meet the first expected result at all, it may be determined that the test failed. For example, if the first test result is that the system does not extract any high-frequency word, or if the high-frequency word extracted by the system is completely different from the first preset result, it may be determined that the function of extracting the high-frequency word by the system is problematic, and it is determined that the test is not passed.

In some embodiments, in the event that the first test result does not fully meet a first expected result, if a portion of the first test result that does not meet the first expected result meets a second expected result, then determining that the test passed. Wherein the second expected result may be determined from a portion of the first test result that does not correspond to the first expected result. Specifically, for example, the first test result is that the system extracts the high-frequency word "debit card" in addition to "financial management", "loan", "deposit", "transfer", and the first expected result is that the system extracts the high-frequency word "financial management", "loan", "deposit", "transfer", and it may be determined that the first test result does not completely conform to the first expected result, in which case, it may not be determined that the function of extracting the high-frequency word by the system has a problem, and it may be that the text data in the first test data set includes the high-frequency word "debit card", but the high-frequency word "debit card" is absent from the first expected result because the high-frequency word "debit card" is not considered by the respective category of text data in the first test data set. In this case, if the system high frequency word functions well, the text of each category in the first test dataset contains the high frequency word "debit card". Thus, it may be determined that the second expected result may be that the keyword "debit card" is a high frequency word in the first test dataset, and it may be determined whether the system is functioning to extract the high frequency word by determining whether the keyword "debit card" in the first test dataset is a high frequency word.

In some embodiments, it may be determined whether the portion of the first test result that does not meet the first expected result meets a second expected result according to the following steps.

S331: and obtaining at least one keyword according to the part, which is not matched with the first expected result, of the first test result.

In some embodiments, taking the function of the test system to extract high frequency words, the first test data set may include text data with the categories of financial, loan, deposit, transfer, for example. The first test result is that the system extracts high-frequency words such as "financial management", "loan", "deposit", "transfer" and "debit card", and the first expected result is that the system extracts high-frequency words such as "financial management", "loan", "deposit" and "transfer", and it can be determined that the part of the first test result which does not conform to the first expected result is the high-frequency word such as "debit card". Further, the keyword "debit card" may be determined to be "debit card" based on the high frequency word "debit card". Of course, the above is only a function of extracting high frequency words by the test system, and the first test data set may include a specific example of text data classified into financial, loan, deposit, and transfer, and in the embodiment of the present specification, other functions of the test system, and test data sets including text data of different types may be also used.

S332: and calculating the frequency of each keyword in each text data in the first test data set.

In some embodiments, the keywords may be used as category names of the text data, or each preset keyword may be corresponding to a category of the text data. Specifically, taking the example that the plurality of preset keywords include "financing", "loan", "deposit" and "transfer", text data in the target data set may be classified into a financing category, a loan category, a deposit category and a transfer category.

In some embodiments, taking the keywords including "financing", "loan" and "deposit" as examples, after obtaining a plurality of preset keywords, a keyword set { F } ₁ ,F ₂ ,F ₃ ,...,F _n }, wherein F ₁ = "financing", F ₂ = "loan", F ₃ = "deposit", n=1, 2,3.

S333: and judging whether the part, which is not matched with the first expected result, of the first test result meets a second expected result or not according to the frequency.

In some embodiments, determining whether the portion of the first test result that does not meet the first expected result meets a second expected result according to the frequency may include: determining the number of text data of each keyword according to the frequency, and the total frequency of each keyword in the first test data set; and judging whether the total frequency of the occurrence of each keyword in the first test data set meets a second expected result or not. If so, it may be determined that the test passed; otherwise the test does not pass.

In some embodiments, taking the function of the test system to extract high frequency words, the first test data set may include text data with the categories of financial, loan, deposit, transfer, for example. And determining the keyword as a debit card according to the fact that the part, which does not accord with the first expected result, of the first test result is the high-frequency word debit card. The number of text data in which the keyword "debit card" appears, and the total frequency in which the keyword "debit card" appears in the first test data set may be determined from the frequency matrix. In particular, if the frequency matrix { f ₁ ,f ₂ ,f ₃ ,...,f _n F in } ₁ The corresponding keyword is "debit card", f can be determined ₁ The number of frequency matrices other than zero will f ₁ The number of frequency matrices that are not zero is determined as the number of text data where the keyword "debit card" appears; the frequency matrix f can also be used ₁ Is added to the values of (2) to obtain the switchThe key word "debit card" is the total frequency of occurrence in the first test dataset. If the number of text data of the keyword 'debit card' is larger than the preset number and/or the total frequency of occurrence of the keyword 'debit card' in the first test data set is larger than the preset frequency, determining that the keyword 'debit card' is a high-frequency word in the first test data set, wherein the part of the first test result, which does not accord with the first expected result, accords with the second expected result, and the system has good function of extracting the high-frequency word after test; otherwise, the test is not passed, and the function of extracting the high-frequency words by the system is problematic.

The embodiment of the specification can acquire a first test data set; the first test dataset includes at least one category of text data; inputting the first test data set into a system to be tested, and obtaining a first test result; and if the first test result does not completely accord with the first expected result, determining that the test is passed if the part of the first test result which does not accord with the first expected result accords with the second expected result. The test method provided by the embodiment of the specification adopts a method for classifying the data according to the occurrence frequency of a plurality of preset keywords in the data to obtain the test data set, and adopts a test method combining two tests, so that the test efficiency and the test accuracy can be improved.

The present embodiment provides a scene example, as shown in fig. 4, and fig. 4 is a schematic diagram of the scene example provided in the present embodiment.

In this scenario example, taking a function of extracting a high-frequency word by a test system as an example, acquiring a target data set from a data source, classifying the target data set to obtain classified text data, selecting at least one type of text data from the classified text data as a test data set, and using the test data set to test the function of extracting the high-frequency word by the system. Specifically, the following steps may be included.

S1: a target dataset is acquired.

In this scenario example, the target data may include text data and/or audio data.

S2: and desensitizing the target data set.

In this scenario example, the target data set obtained from the banking system typically involves the customer's security data or some commercially sensitive data. In order to protect sensitive privacy data, the data in the target data set can be modified for use, for example, personal information such as an identity card number, a mobile phone number, a card number, a client number and the like in the target data set is subjected to data desensitization treatment.

S3: and judging whether the target data set is text data or not.

If yes, S5 is performed, otherwise S4 is performed.

S4: the audio data is converted into corresponding text data.

In this scenario example, the content expressed in the audio data may be output in the form of text through a voice recognition technique, resulting in corresponding text data and executing S5.

S5: the text data is classified.

In this example of the scene, a plurality of preset keywords "financial management", "loan", "deposit", "transfer", "remittance" may be acquired, and the frequency of occurrence of each preset keyword in each text data may be calculated.

In this scenario example, a preset keyword may be used as a category name of text data. According to the frequency, the category of each text data in the target data set can be determined to be pending, financial, loan, deposit, transfer, remittance and multiple categories.

S6: and selecting text data of at least one category in the classified target data set as a test data set.

In this scenario example, text data classified into financial, loan, deposit, transfer may be used as the test dataset.

S7: the test system extracts the function of the high frequency word.

Specifically, the test data set may be input into the system to obtain an output result.

S8: and judging whether the output result completely accords with the first expected result.

In this scenario example, the first expected result is that the high frequency words extracted by the system are "financing", "loan", "deposit", "transfer". If the output result is that the system extracts the high-frequency words of financial accounting, loan, deposit and transfer, the output result completely accords with the first expected result, and the test is passed; if the output result is that the system extracts the high frequency words of "financial", "loan", "deposit", "transfer", "debit card", the output result does not completely conform to the first expected result, S9 may be performed.

S9: and determining the category name according to the part which is not matched with the first expected result in the output result.

In this scenario example, if the high frequency word "debit card" is extracted for the system in the portion of the output result that does not match the first expected result, then it may be determined that "debit card" is a keyword and "debit card" is a category name of text data.

S10: and classifying the test data set according to the determined category names.

In this scenario example, the frequency of occurrence of the keyword "debit card" in the text data of the test dataset may be calculated, the text data in the test dataset may be reclassified according to the frequency determination, and the new categories may be pending, financial, loan, deposit, transfer, debit card, multi-category.

S11: it is determined whether the newly classified text data meets a second expected result.

In this scenario example, it may be derived from the portion of the output result that does not correspond to the first expected result, the second expected result being the keyword "debit card" being a high frequency word in the test dataset.

In this scenario example, the total frequency of occurrence of the keyword "debit card" in the text data of the debit card category, and the text data of the multi-category may be calculated, and whether the second expected result is met may be determined according to the calculation result. Specifically, if the number of text data of the debit card category is greater than a preset number and/or the total frequency of occurrence of the keyword "debit card" in the text data of the multi-category is greater than a preset frequency, determining that the keyword "debit card" is a high-frequency word in the test data set, wherein the part of the output result which does not accord with the first expected result accords with the second expected result, and the system has good function of extracting the high-frequency word through the test; otherwise, the test is not passed, and the function of extracting the high-frequency words by the system is problematic.

The present specification embodiment also provides a computer-readable storage medium storing computer program instructions that when executed implement a data classification method: acquiring a target data set; the target dataset includes a plurality of text data; calculating the occurrence frequency of a plurality of preset keywords in each text data; and determining the category of each text data in the target data set according to the frequency.

In the present embodiment, the storage medium includes, but is not limited to, a random access Memory (Random Access Memory, RAM), a Read-Only Memory (ROM), a Cache (Cache), a Hard disk (HDD), or a Memory Card (Memory Card). The memory may be used to store computer program instructions. In this embodiment, the functions and effects of the program instructions stored in the computer readable storage medium may be explained in comparison with other embodiments, and are not described herein.

Referring to fig. 5, on a software level, the embodiment of the present disclosure further provides a data classification apparatus, which may specifically include the following structural modules.

An acquisition module 510 for acquiring a target data set; the target dataset includes a plurality of text data;

a calculating module 520, configured to calculate a frequency of occurrence of a plurality of preset keywords in each text data;

a classification module 530 is configured to determine a category of each text data in the target data set according to the frequency.

In some embodiments, the classification module 530 may include: the first classification sub-module is used for determining the category of the text data as a pending category under the condition that the occurrence frequency of each preset keyword is the preset frequency; the second classification sub-module is used for determining that the category of the text data is the category corresponding to the keyword with the occurrence frequency not being the preset frequency under the condition that the occurrence frequency of only one preset keyword is not the preset frequency and the occurrence frequency of other preset keywords is the preset frequency; and the third classification sub-module is used for determining that the category of the text data is multiple categories under the condition that the occurrence frequency of at least two preset keywords is not the preset frequency.

The present specification embodiments also provide a computer-readable storage medium of a test method, the computer-readable storage medium storing computer program instructions that, when executed, implement: acquiring a first test data set; the first test dataset includes at least one category of text data; inputting the first test data set into a system to be tested, and obtaining a first test result; and if the first test result does not completely accord with the first expected result, determining that the test is passed if the part of the first test result which does not accord with the first expected result accords with the second expected result.

Referring to fig. 6, at a software level, the embodiment of the present disclosure further provides a testing apparatus, which may specifically include the following structural modules.

An acquisition module 610 for acquiring a first test dataset; the first test dataset includes at least one category of text data;

a first testing module 620, configured to input the first test data set into a system to be tested, and obtain a first test result;

and a second test module 630, configured to, in a case where the first test result does not completely meet the first expected result, pass the test if a portion of the first test result that does not meet the first expected result meets a second expected result.

In some embodiments, the apparatus may further comprise: and the first determining module is used for determining that the test passes under the condition that the first test result completely accords with the first expected result.

In some embodiments, the apparatus may further comprise: and the second determining module is used for determining that the test does not pass under the condition that the first test result completely does not accord with the first expected result.

In some embodiments, the apparatus may further comprise: and a third determining module, configured to determine that the test does not pass if a portion of the first test result that does not meet the first expected result does not meet the second expected result.

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and the same or similar parts of each embodiment are referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the device embodiments and the apparatus embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments in part.

Those skilled in the art, after reading this specification, will recognize without undue burden that any and all of the embodiments set forth herein can be combined, and that such combinations are within the scope of the disclosure and protection of the present specification.

In the 90 s of the 20 th century, improvements to one technology could clearly be distinguished as improvements in hardware (e.g., improvements to circuit structures such as diodes, transistors, switches, etc.) or software (improvements to the process flow). However, with the development of technology, many improvements of the current method flows can be regarded as direct improvements of hardware circuit structures. Designers almost always obtain corresponding hardware circuit structures by programming improved method flows into hardware circuits. Therefore, an improvement of a method flow cannot be said to be realized by a hardware entity module. For example, a programmable logic device (Programmable Logic Device, PLD) (e.g., field programmable gate array (Field Programmable Gate Array, FPGA)) is an integrated circuit whose logic function is determined by the programming of the device by a user. A designer programs to "integrate" a digital system onto a PLD without requiring the chip manufacturer to design and fabricate application-specific integrated circuit chips. Moreover, nowadays, instead of manually manufacturing integrated circuit chips, such programming is mostly implemented with "logic compiler" software, which is similar to the software compiler used in program development and writing, and the original code before the compiling is also written in a specific programming language, which is called hardware description language (Hardware Description Language, HDL), but HDL is not only one, but a plurality of kinds, such as ABEL (Advanced Boolean Expression Language), AHDL (AlteraHardware Description Language), confluence, CUPL (Cornell University Programming Language), HDCal, JHDL (Java Hardware Description Language), lava, lola, myHDL, PALASM, RHDL (Ruby Hardware Description Language), etc., VHDL (Very-High-Speed Integrated Circuit Hardware Description Language) and Verilog2 are most commonly used at present. It will also be apparent to those skilled in the art that a hardware circuit implementing the logic method flow can be readily obtained by merely slightly programming the method flow into an integrated circuit using several of the hardware description languages described above.

The system, apparatus, module or unit set forth in the above embodiments may be implemented in particular by a computer chip or entity, or by a product having a certain function. One typical implementation is a computer. In particular, the computer may be, for example, a personal computer, a laptop computer, a cellular telephone, a camera phone, a smart phone, a personal digital assistant, a media player, a navigation device, an email device, a game console, a tablet computer, a wearable device, or a combination of any of these devices.

From the above description of embodiments, it will be apparent to those skilled in the art that the present description may be implemented in software plus a necessary general purpose hardware platform. Based on this understanding, the technical solution of the present specification may be embodied in essence or a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present specification.

In this specification, each embodiment is described in a progressive manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.

The specification is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like.

The description may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The specification may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

Although the present specification has been described by way of example, it will be appreciated by those skilled in the art that there are many variations and modifications to the specification without departing from the spirit of the specification, and it is intended that the appended claims encompass such variations and modifications as do not depart from the spirit of the specification.

Claims

1. A method of classifying data, the method comprising:

acquiring a target data set; the target dataset includes a plurality of text data;

calculating the occurrence frequency of a plurality of preset keywords in each text data;

determining the category of each text data in the target data set according to the frequency;

the method further comprises the steps of: under the condition that the target data set at least comprises one type of text data, inputting the target data set into a system to be tested, and obtaining a first test result; if the first test result does not completely accord with the first expected result, determining that the test is passed if the part of the first test result which does not accord with the first expected result accords with the second expected result;

the method further comprises the steps of: judging whether the part, which is not matched with the first expected result, of the first test result meets a second expected result according to the following steps:

Obtaining at least one keyword according to the part, which is not in accordance with the first expected result, of the first test result;

calculating the occurrence frequency of each keyword in each text data in the first test data set;

judging whether a part, which is not matched with the first expected result, of the first test result meets a second expected result according to the frequency so as to determine whether the function of extracting the high-frequency word of the system to be tested is good;

the determining whether the portion of the first test result, which does not conform to the first expected result, conforms to a second expected result according to the frequency includes: determining the number of text data of each keyword according to the frequency, and the total frequency of each keyword in the first test data set; and judging whether the total frequency of the occurrence of each keyword in the first test data set meets a second expected result or not.

2. The method of claim 1, wherein the target data set further comprises audio data; correspondingly, the method further comprises the steps of: the audio data is converted into text data.

3. The method of claim 1, wherein the category name of the text data includes the preset keyword.

4. The method of claim 1, wherein said determining the category of each text data in the target dataset based on the frequency comprises:

and under the condition that the occurrence frequency of each preset keyword is smaller than the preset frequency, determining the category of the text data as a pending category.

5. The method of claim 1, wherein said determining the category of each text data in the target dataset based on the frequency comprises:

and determining the category of the text data as the category corresponding to the keywords with the occurrence frequency greater than or equal to the preset frequency under the condition that the occurrence frequency of only one preset keyword is greater than or equal to the preset frequency and the occurrence frequency of other preset keywords is less than the preset frequency.

6. The method of claim 1, wherein said determining the category of each text data in the target dataset based on the frequency comprises:

and determining the category of the text data to be multi-category under the condition that the occurrence frequency of at least two preset keywords is greater than or equal to the preset frequency.

7. A method of testing, the method comprising:

Acquiring a first test data set; the first test dataset includes at least one category of text data;

inputting the first test data set into a system to be tested, and obtaining a first test result;

if the first test result does not completely accord with the first expected result, determining that the test is passed if the part of the first test result which does not accord with the first expected result accords with the second expected result;

8. The method of claim 7, wherein the test is determined to pass if the first test result fully meets a first expected result.

9. The method of claim 7, wherein in the event that the first test result does not completely meet a first expected result, determining that the test fails.

10. The method of claim 7, wherein if the portion of the first test result that does not correspond to the first expected result does not correspond to a second expected result, determining that the test does not pass.

11. The method of claim 7, wherein the acquiring a first test data set comprises:

and acquiring text data of at least one category in the classified target data set as the first test data set.

12. A data sorting apparatus, the apparatus comprising:

the acquisition module is used for acquiring a target data set; the target dataset includes a plurality of text data;

The calculating module is used for calculating the occurrence frequency of a plurality of preset keywords in each text data;

the classification module is used for determining the category of each text data in the target data set according to the frequency;

the device is also for: under the condition that the target data set at least comprises one type of text data, inputting the target data set into a system to be tested, and obtaining a first test result; if the first test result does not completely accord with the first expected result, determining that the test is passed if the part of the first test result which does not accord with the first expected result accords with the second expected result;

the device is further used for judging whether the part, which is not matched with the first expected result, of the first test result meets a second expected result according to the following steps:

The device is also for: determining the number of text data of each keyword according to the frequency, and the total frequency of each keyword in the first test data set; and judging whether the total frequency of the occurrence of each keyword in the first test data set meets a second expected result or not.

13. The apparatus of claim 12, the classification module comprising:

and the first classification sub-module is used for determining the category of the text data as a pending category under the condition that the occurrence frequency of each preset keyword is the preset frequency.

14. The apparatus of claim 12, the classification module comprising:

and the second classification sub-module is used for determining that the category of the text data is the category corresponding to the keyword with the occurrence frequency not being the preset frequency under the condition that the occurrence frequency of only one preset keyword is not the preset frequency and the occurrence frequency of other preset keywords is the preset frequency.

15. The apparatus of claim 12, the classification module comprising:

and the third classification sub-module is used for determining that the category of the text data is multiple categories under the condition that the occurrence frequency of at least two preset keywords is not the preset frequency.

16. A test apparatus, the apparatus comprising:

the acquisition module is used for acquiring a first test data set; the first test dataset includes at least one category of text data;

the first test module is used for inputting the first test data set into a system to be tested and obtaining a first test result;

the second test module is used for passing the test if the part of the first test result, which does not accord with the first expected result, accords with the second expected result under the condition that the first test result does not accord with the first expected result;

17. The apparatus of claim 16, wherein the apparatus further comprises:

and the first determining module is used for determining that the test passes under the condition that the first test result completely accords with the first expected result.

18. The apparatus of claim 16, wherein the apparatus further comprises:

and the second determining module is used for determining that the test does not pass under the condition that the first test result completely does not accord with the first expected result.

19. The apparatus of claim 16, wherein the apparatus further comprises:

and a third determining module, configured to determine that the test does not pass if a portion of the first test result that does not meet the first expected result does not meet the second expected result.

20. A computer readable storage medium having stored thereon computer program instructions that when executed implement: acquiring a target data set; the target dataset includes a plurality of text data; calculating the occurrence frequency of a plurality of preset keywords in each text data; determining the category of each text data in the target data set according to the frequency; the computer program instructions, when executed, further implement: when determining that a target data set at least comprises one type of text data, inputting the target data set into a system to be tested, and obtaining a first test result; if the first test result does not completely accord with the first expected result, determining that the test is passed if the part of the first test result which does not accord with the first expected result accords with the second expected result;

The computer program instructions, when executed, implement: judging whether the part, which is not matched with the first expected result, of the first test result meets a second expected result according to the following steps:

the computer program instructions, when executed, implement: determining the number of text data of each keyword according to the frequency, and the total frequency of each keyword in the first test data set; and judging whether the total frequency of the occurrence of each keyword in the first test data set meets a second expected result or not.

21. A computer readable storage medium having stored thereon computer program instructions that when executed implement: acquiring a first test data set; the first test dataset includes at least one category of text data; inputting the first test data set into a system to be tested, and obtaining a first test result; if the first test result does not completely accord with the first expected result, determining that the test is passed if the part of the first test result which does not accord with the first expected result accords with the second expected result;

judging whether a part, which is not matched with the first expected result, of the first test result meets a second expected result according to the frequency so as to determine whether the function of extracting the high-frequency word of the system to be tested is good; the computer program instructions, when executed, implement: determining the number of text data of each keyword according to the frequency, and the total frequency of each keyword in the first test data set; and judging whether the total frequency of the occurrence of each keyword in the first test data set meets a second expected result or not.