WO2019179010A1 - 数据集获取方法、分类方法、装置、设备及存储介质 - Google Patents
数据集获取方法、分类方法、装置、设备及存储介质 Download PDFInfo
- Publication number
- WO2019179010A1 WO2019179010A1 PCT/CN2018/100779 CN2018100779W WO2019179010A1 WO 2019179010 A1 WO2019179010 A1 WO 2019179010A1 CN 2018100779 W CN2018100779 W CN 2018100779W WO 2019179010 A1 WO2019179010 A1 WO 2019179010A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- quality
- text data
- message
- data
- quality check
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Definitions
- the present application relates to the field of data processing technologies, and in particular, to a data set acquisition method, a data collection method, a device, a computer device, and a storage medium.
- a large amount of dialogue text may be generated with the customer, and these dialogue texts will be saved in the platform for agent sales.
- the current method is to randomly extract a certain amount of text content, and then analyze it by manual methods, such as finding out the non-compliant place in the dialogue text (also known as the place of violation, that is, where there is an error), Improve non-compliance or attend training for agents.
- the method of using random extraction and manual analysis is obviously not efficient.
- the data in the dialogue text is very large, in order to detect as many as possible in the dialogue text, the extracted text content will increase.
- the content of manual quality inspection will also increase, and the efficiency of manual quality inspection is very low.
- the embodiment of the present application provides a data set acquisition method, a data collection method, a device, a computer device, and a storage medium, which can extract a data set with high accuracy and improve the accuracy of data classification.
- an embodiment of the present application provides a data set obtaining method, where the method includes:
- the embodiment of the present application further provides a method for classifying data sets, the method comprising:
- the data set training classification model extracted by the data set acquisition method described in the above first aspect is used, and the trained quality classification point is obtained by classifying the unconfirmed message level dialogue text data by using the trained classification model, and marking Obtaining a quality inspection result; updating the quality inspection result according to the user's modification request for the quality inspection point in the dialogue text data; updating the classification model according to the updated data; and using the updated classification model for the unqualityd message level conversation
- the text data is classified and the quality check points are obtained and marked to obtain the quality inspection result.
- an embodiment of the present application provides a data set obtaining apparatus, where the apparatus includes a unit for performing the data set acquiring method according to the above first aspect.
- the embodiment of the present application further provides an apparatus for utilizing data set classification, the apparatus comprising means for performing the method for utilizing data set classification according to the above first aspect.
- an embodiment of the present application provides a computer device, where the computer device includes a memory, and a processor connected to the memory, the memory is configured to store a computer program, and the processor is configured to run the memory A computer program stored in the method of performing the data set acquisition method of the first aspect described above or the method of using the data set classification described in the first aspect above.
- an embodiment of the present application provides a computer readable storage medium, where the computer readable storage medium stores a computer program, where the computer program includes program instructions, and when the program instructions are executed by a processor, implementing the foregoing The method of data set acquisition of the first aspect or the method of using the data set classification described in the first aspect above.
- the embodiment of the present application obtains a more accurate data source by using the full-text search engine to mark the quality inspection result, and the user updates the quality inspection result marked by the full-text search engine, and then extracts the data set from the data source according to the preset format. So, extract a more accurate data set.
- the data set to train the classification model, and then using the trained classification model to classify the unconfirmed message level dialogue text data to obtain the quality inspection result, combined with the user's update of the quality inspection result classified by the classification model, A more accurate quality inspection result is obtained, and the updated data is used to update the classification model, and the updated classification model is used to classify the quality inspection points, so that the classification accuracy of the classification model can be improved.
- FIG. 1 is a schematic flowchart of a method for acquiring a data set according to an embodiment of the present application
- FIG. 2 is a schematic diagram of a sub-flow of a method for acquiring a data set according to an embodiment of the present application
- FIG. 3 is a schematic diagram of another sub-flow of a data set obtaining method according to an embodiment of the present application.
- FIG. 4 is a schematic flowchart of a method for classifying data sets according to an embodiment of the present application
- FIG. 5 is a schematic block diagram of a data set obtaining apparatus according to an embodiment of the present application.
- Figure 6 is a schematic block diagram of a marking unit provided by the embodiment of the present application.
- FIG. 7 is a schematic block diagram of an extracting unit provided by an embodiment of the present application.
- FIG. 8 is a schematic block diagram of an apparatus for classifying data sets according to an embodiment of the present application.
- FIG. 9 is a schematic block diagram of a computer device provided by an embodiment of the present application.
- first, second, etc. may be used herein to describe various elements, these elements should not be limited to these terms. These terms are only used to distinguish these elements from each other.
- first acquisition unit may be referred to as a second acquisition unit without departing from the scope of the present application, and similarly, the second acquisition unit may be referred to as a first acquisition unit.
- the first acquisition unit and the second acquisition unit are both acquisition units, but they are not the same acquisition unit.
- FIG. 1 is a schematic flowchart diagram of a method for acquiring a data set according to an embodiment of the present application. The method includes the following steps S101-S106.
- the message-level dialog text data is obtained from the platform of the agent sales, and the dialog text data stores the dialogue text between the agent and the client.
- the dialog text data belongs to the message level, and can be understood as the dialog text data is data saved in units of messages sent between the agent and the client.
- the dialog text data is composed of a plurality of message text data, and each message text data includes a message number. , sender, recipient, specific message content, time to send the message, etc.
- the method of preprocessing includes replacement, filtering, and the like.
- the replacement includes replacing the English in the corresponding message text data in the dialog text data with Chinese, etc.; filtering includes filtering out the numbers, punctuation marks, expressions, and garbled characters in the corresponding message text data in the dialog text data.
- the message text data in the dialog text data is preprocessed to preserve the plain text message in the specific message content in the message text data, facilitating subsequent processing.
- the quality inspection point can be understood as a place of non-compliance or violation, that is, where there is a mistake.
- Each quality inspection point has a quality inspection point identifier, such as A47, which represents the 47th quality inspection point in category A.
- the rules corresponding to the quality inspection point include keywords and logic operations.
- the rules corresponding to the quality inspection point and the quality inspection point for example: A47, fund and dividend.
- the keywords include: funds, dividends, logical operations including and.
- the A47 quality checkpoint indicates that if a fund and dividends appear in a message, the message is considered to be in violation. It can be understood that the fund does not involve dividends. It can also be understood that when it comes to the fund, it will not be expected to say dividends.
- the full-text search engine refers to the ElasticSearch (short for ES) search engine.
- the ES uses the keyword and uses the combination of must, should, and must not be provided in the API interface of the ES to implement the rule corresponding to the quality checkpoint, and according to the corresponding rule (query condition) from the preprocessed dialog text data.
- Perform a query search find the matching quality inspection points and mark them to obtain the ES quality checkpoint results, and use the ES quality checkpoint results as the quality inspection results. Among them, the mark is marked with the quality check point mark, and the ES quality check point result indicates the result of the quality check point obtained by the ES query match.
- the step S103 includes the following steps S201-S203.
- S201 Perform segmentation of the pre-processed dialog text data.
- the specific message content in each message text data in the pre-processed dialog text data is divided into multiple words by the word segmentation in the full-text search engine, such as the message "I came to Beijing Tsinghua University", and the result of the word segmentation is "I came Beijing Tsinghua University.”
- S202 Create an inverted index on the data after the word segmentation. Specifically, the number and position of the statistically divided words appearing in the dialog text data; the divided words are indexed backward according to the number of occurrences and positions. For example, the number and position of the statistical word "dividend" appearing in the dialog text data, wherein the position in the dialog text data is included in which dialog text data table, which message text data (which can be represented by a message number), and the like.
- the inverted index is a storage form that implements the "word-document matrix". By inverting the index, the "document list" containing the word can be quickly obtained according to the word. As in the dialog text data, the inverted text index can quickly obtain the message text data containing the word according to the divided words, that is, which words appear in the message text data.
- the quality inspection point mark is performed.
- the mark in the dialog text data can be understood as the corresponding quality check point mark after each message text data in the dialog text data.
- the session-level session text data is understood as data stored in a conversation (session) between the agent and the client, that is, the session text data stores a plurality of conversation data between the agent and the client, such as each
- the conversation data can include the conversation number and the conversation content.
- the marked dialog text data including the quality checkpoint is integrated into session text data including the session level of the quality checkpoint, and the specific integration process includes: searching each message text data from the marked dialog text data including the quality checkpoint The sender and receiver in the sender, the sender and the receiver as a set; grouping the message text data in the dialog text data according to the set; sorting the data of each group according to the time of sending the message and displaying according to a predetermined format To form session text data at the session level including the quality checkpoint.
- the predetermined format may be: sender; recipient; conversation content; ES quality checkpoint result.
- the plurality of message text data in the conversation content may be displayed according to a format: a time of the message [space] specific message content. Such as 2017-01-0112:01:02 Teacher Li, is it?
- Each message text data corresponds to an ES quality checkpoint result.
- the date can also be included in the predetermined format, which is the date of the quality inspection.
- the session text data including the session level of the quality checkpoint is the message text data in the dialog text data of the message level including the quality checkpoint after being marked, and the sender information is sorted according to the order of time and the sender information.
- the tagged conversation text data including the quality checkpoint of the quality checkpoint and the session text data of the session level including the quality checkpoint are stored in the database in the form of a data table, such as an Oracle database.
- the marked message text data including the quality checkpoint of the quality checkpoint and the session text data of the session level including the quality checkpoint may be separately saved as a plurality of data tables or may be saved as one data table.
- Table 1 The integrated session text data including the quality checkpoint is shown in Table 1. It should be noted that the example shown in Table 1 is only an example. Table 1 may include a plurality of session text data, wherein each session text data includes a date (referred to as a quality inspection date), a sender, a recipient, a conversation content, and an ES quality checkpoint result, wherein, in the conversation content It includes multiple pieces of message text data, and each message text data corresponds to an ES quality checkpoint result. For example, the specific message content corresponding to the message text data: In the end, is there any good product?
- the result of the ES quality checkpoint of the specific message content corresponding to the piece of message text data is empty, indicating the pass/compliance of the specific message content corresponding to the piece of message text data.
- the ES quality checkpoint result of the specific message content corresponding to the message text data is A45, indicating that the specific message content corresponding to the text message data of the message message is in violation, and specifically corresponds to the content of the A45 quality checkpoint.
- the user's authority is obtained, for example, the user's authority is obtained according to the user's account number and password; and the current user's authority is determined as a preset authority, wherein the user having the preset authority may update the quality inspection result; Authorization, updating the quality inspection result according to the modification request of the quality check point in the session text data by the user of the preset authority.
- the user who meets the preset authority can view the specific content of the session text message, the sender, the recipient, the date, the ES quality checkpoint result, and the options that the user can modify.
- the user opens a page containing the specific content of the session text message, the sender, the recipient, the date, the ES quality checkpoint result, and the option that the user can modify, it can be understood as receiving the user's modification request for the session text data quality check point.
- the options that the user can modify include manual quality checkpoint results, quality inspection violation notes, and compliance. These items are empty before the user has modified them. The user can edit and modify according to the actual situation to update the quality inspection results.
- the results of the artificial quality inspection point are indicated by the quality inspection point identifier, and the specific text content corresponding to the quality inspection point (violation point) in the quality inspection violation remarks and the reason for the violation of the quality inspection point, the reason for the violation of the quality inspection point is written in In the parentheses, after the specific text content corresponding to the quality checkpoint. Users who meet the default permissions can also change the current QC point to Qualified/Compliance (no error).
- the result of the artificial quality inspection point is taken as the result of the updated quality inspection point, and the final quality inspection point result is subject to the updated quality inspection point result. If the result value of the artificial quality checkpoint is consistent with the result of the ES quality checkpoint, it is still necessary to fill in the result of the ES quality checkpoint in the result of the artificial quality checkpoint. If the result of the manual quality checkpoint is empty, it means that the quality checkpoint result of the message is qualified/compliant.
- each session text data table has more options for manual point quality inspection results, quality inspection violation notes, and compliance. It should be noted that these options are visible to users who meet the preset permissions after opening the corresponding table.
- the specific message content of the message text data Yes, this product is called xxx, the product interest rate is 5 to 7 percent, and after six months, the loan can be 10 times the total amount of the loan.
- the result of the ES quality checkpoint of the specific message content is: A45
- the result of the artificial quality checkpoint is: A42, A45
- the corresponding quality inspection violation note is: Yes.
- This product is called xxx
- the interest rate of this product is 5 to 7 percent (the product interest rate information is wrong, it is 4 to 8 percent); after six months, the total amount of money can be loaned for six months. 10 times (the loan amount is not limited), corresponding to A42, A45 quality inspection violation remarks.
- the quality inspection content and the data content completely related to the quality inspection point can be viewed in the pre-stored file.
- the view command can be generated by clicking the view button.
- the pre-stored file stores the contents of all QC points and the text content messages with the correct data related to the QC points.
- the content of the quality checkpoint corresponding to the quality checkpoint result and the text content message completely related to the quality checkpoint are found out from the pre-stored file according to the ES quality checkpoint result. In this way, the efficiency and accuracy of the manual quality inspection are improved, and the quality inspection violations are conveniently facilitated.
- the preset format includes: a specific message content corresponding to the message text data, a quality check point result of the specific message content corresponding to each message text data, and a quality inspection violation remark.
- the quality check point result of the specific message content corresponding to each message text data is also the specific message content corresponding to the message text data, and the updated quality check point result.
- step S106 includes the following steps S301-S304.
- each violation point needs to be separated to facilitate subsequent analysis of each quality inspection point, such as using the extracted data set for classification.
- the first quality inspection point in the artificial quality inspection point result is used as the quality inspection point result of the specific message content of the message text data; the next parenthesis pair is before The content between one parenthesis pair is used as the specific message content of the other message text data, and the content of the next parenthesis pair is used as the quality inspection violation of the other message text data, and the artificial quality check result is the next one.
- the quality checkpoint is the result of the quality checkpoint of the specific message content corresponding to the other message text data.
- the message text data specific message content corresponding to the plurality of quality inspection points and the plurality of quality inspection points are separated from the corresponding quality inspection violation remarks, and respectively corresponding to each other, forming a single quality inspection point as The quality checkpoint results and matches the data in the preset format.
- the quality checkpoint result of the specific message content corresponding to the message text data is marked as a compliance identifier.
- the compliance can be expressed by the identifier good, or can be represented by other identifiers.
- the extracted data includes data corresponding to separating the plurality of quality inspection points, the quality inspection point result is the compliance data, and the quality inspection point result is one corresponding data.
- Table 3 shows a display example of the extracted data set. It should be noted that Table 3 is just an example.
- the data set includes a plurality of pieces of data, each of which includes a specific message content corresponding to the message text data, a quality check point result corresponding to the specific message content (quality check mark), and a quality check violation note.
- the specific message content is: Yes
- this product is called xxx
- the interest rate of this product is 5 to 7 percent
- the quality check point corresponding to the content of the message is: A42
- the quality inspection violation note is: the product interest rate information is wrong, it is 4 to 8 percent.
- the second piece of data, the specific message content 10 months after the loan can be 10 times the total amount of the loan, the quality of the message corresponding to the quality checkpoint is: A45, quality inspection violations note: the loan amount is not limited.
- the third piece of data, the specific message content is: there is a good product, whether you want to see, the quality of the message content corresponding to the quality checkpoint is: good, identify the message compliance.
- FIG. 4 is a schematic flowchart diagram of a method for classifying data sets according to an embodiment of the present application. As shown in FIG. 4, the method includes S401-S410. The steps S401-S406 correspond to the steps of the embodiment shown in FIG. 1, and details are not described herein again. Only steps S407-S410 are described below.
- the classification model can be any multi-classification model, such as a long-term neural network model, a random forest classification model, and the like.
- the process of training the classification model includes: acquiring a data set; using a word segmentation tool to segment the text information in the data set; using a preset word vector model to process the data after the word segmentation, and obtaining a corresponding The word vector; the neural network model is trained based on the word vector and the corresponding quality checkpoint in the data set.
- the word segmentation tool can be a word segmentation, using the precise pattern of the word segmentation to segment the text information in the data set, and divide the text information in the data set into multiple words through the word segmentation, such as the message “I came to Beijing Tsinghua University”, the participle The result was "I came to Beijing Tsinghua University.”
- the word embedding model refers to the word2vec word vector model of gensim. Word2vec is actually a shallow neural network. Word2vec can effectively train on millions of dictionary and hundreds of millions of data sets. The training result is a word vector, which can measure the word-to-word well. Similarity.
- the preset word vector model can be obtained through pre-training.
- the neural network model is trained based on the word vector and the corresponding quality checkpoint in the data set.
- the trained classification model is used to classify the unqualified message level dialogue text data to obtain the quality inspection points and mark them to obtain the quality inspection point results.
- the unconfirmed message level conversation text data is classified by the updated classification model to obtain a quality inspection point and marked to obtain a quality inspection result.
- the extracted data set is used to train the classification model
- the classification text model is used to classify the dialogue text data to obtain a quality inspection point
- the quality inspection result is updated according to the user's modification request for the quality inspection point in the conversation text data.
- the classification model is updated according to the updated data
- the updated text is used to classify the dialog text data to obtain a quality inspection point.
- the embodiment obtains a relatively accurate quality check result, and then uses the updated data to update the classification model, and uses the updated classification model to classify the quality checkpoint to make the updated classification.
- the model can be classified more accurately, which improves the accuracy of classification of classification models.
- This embodiment combines human intelligence to form a hybrid intelligent paradigm of people in the loop, which improves the level of machine intelligence.
- FIG. 5 is a schematic block diagram of a data set obtaining apparatus according to an embodiment of the present application.
- the apparatus 50 includes an obtaining unit 501, a pre-processing unit 502, a marking unit 503, an integrating unit 504, a quality inspection updating unit 505, and an extracting unit 506.
- the obtaining unit 501 is configured to acquire dialog text data of a message level.
- the pre-processing unit 502 is configured to preprocess the message text data of the message level.
- the marking unit 503 is configured to use a full-text search engine to query and mark the quality check points matching the corresponding rules from the pre-processed dialog text data according to a preset rule corresponding to the quality check point and the quality check point. To get the quality inspection results.
- the marking unit 503 includes a word segmentation unit 601, an index unit 602, and a matching tag unit 603.
- the word segmentation unit 601 is configured to segment the pre-processed dialog text data.
- the indexing unit 602 is configured to establish an inverted index on the data after the word segmentation.
- the matching flag unit 603 uses the established inverted index and the full-text search engine to query the quality matching the corresponding rule from the pre-processed dialog text data according to the preset rule corresponding to the quality check point and the quality check point. Check points and mark them.
- the integration unit 504 is configured to integrate the marked dialog text data including the quality checkpoint into session text data including a session level of the quality checkpoint.
- the marked dialog text data including the quality checkpoint is integrated into session text data including the session level of the quality checkpoint, that is, the integration unit 504, including the set search unit, the grouping unit, and the sorting display unit.
- the collection search unit is configured to search for senders and receivers in each message text data from the marked conversation text data including the quality checkpoints, and use the sender and the receiver as a set.
- a grouping unit for grouping message text data in the dialog text data according to the set.
- the sorting display unit is configured to sort the data of each group according to the time of sending the message and display in a predetermined format to form session text data including the session level of the quality checkpoint.
- the quality check update unit 505 is configured to update the quality check result according to the user's modification request for the quality check point in the session text data.
- the extracting unit 506 extracts the data set from the updated data according to a preset format.
- the extracting unit 506 includes a data judging unit 701, a separating unit 702, an adding marking unit 703, and a data set extracting unit 704.
- the determining unit 701 is configured to determine, for each piece of message text data, whether there is a plurality of quality check point results of the specific message content corresponding to the message text data or whether the quality check point result is empty.
- the separating unit 702 is configured to: if the quality inspection point result is multiple, the message text data corresponding to the plurality of quality inspection points and the plurality of quality inspection points according to the plurality of quality inspection points, the specific message content and the corresponding quality inspection violation remarks Separated and correspondingly respectively, data with a single quality inspection point as a quality inspection point result and conforming to a preset format is formed.
- the tag unit 703 is added, and if the quality check result is empty, the quality check point result of the specific message content corresponding to the message text data is marked as a compliance mark.
- the data set extracting unit 704 is configured to extract a specific message content corresponding to the message text data, a quality check point result corresponding to the specific message content of the message text data, and a quality check violation note as the data set.
- the extracted data includes data corresponding to separating the plurality of quality inspection points, the quality inspection point result is the compliance data, and the quality inspection point result is one corresponding data.
- the extracted data set can be seen in the example of FIG.
- FIG. 8 is a schematic block diagram of an apparatus for classifying data sets according to an embodiment of the present application.
- the apparatus 80 includes an obtaining unit 801, a pre-processing unit 802, a marking unit 803, an integrating unit 804, a quality inspection updating unit 805, an extracting unit 806, a sorting unit 807, and a model updating unit 808.
- the difference between this embodiment and the embodiment shown in FIG. 5 is that the classification unit 807 and the model update unit 808 are added.
- Others, such as the obtaining unit 801, the pre-processing unit 802, the marking unit 803, the integrating unit 804, the quality checking and updating unit 805, and the extracting unit 806, may refer to the description of the embodiment of FIG. 5, and details are not described herein again.
- the classification unit 807 and the model update unit 808 will be described below.
- the classification unit 807 is configured to train the classification model by using the extracted data set, and classify the unconfirmed message level conversation text data by using the trained classification model to obtain a quality inspection point and mark the same to obtain a quality inspection point result. .
- the quality check update unit 805 is further configured to update the quality check result according to the user's modification request for the quality check point in the session text data.
- the model updating unit 808 is configured to update the classification model according to the updated data.
- the classification unit 807 is further configured to use the updated classification model to classify and mark the unconfirmed message level conversation text data to obtain a quality check point.
- an apparatus for utilizing data set classification further includes a unit corresponding to the corresponding method embodiment.
- the above apparatus may be embodied in the form of a computer program that can be run on a computer device as shown in FIG.
- FIG. 9 is a schematic block diagram of a computer device according to an embodiment of the present application.
- the computer device 90 may be a portable device such as a mobile phone or a pad, or may be a non-portable device such as a desktop computer.
- the device 90 includes a processor 902, a memory, and a network interface 903 connected by a system bus 901, wherein the memory can include a non-volatile storage medium 904 and an internal memory 905.
- the non-volatile storage medium 904 can store an operating system 9041 and a computer program 9042.
- the processor 902 can be caused to perform a data set acquisition method.
- the processor 902 is used to provide computing and control capabilities to support the operation of the entire device 90.
- the internal memory 905 provides an environment for the operation of a computer program in a non-volatile storage medium that, when executed by the processor 902, causes the processor 902 to perform a data set acquisition method.
- the network interface 903 is used for network communication, such as acquiring data and the like. It will be understood by those skilled in the art that the structure shown in FIG.
- FIG. 9 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the device 90 to which the solution of the present application is applied.
- the specific device 90 may be It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.
- the processor 902 is configured to execute a computer program stored in a memory to implement any of the foregoing embodiments of the data set acquisition method.
- processor 902 when computer program 9042 is executed, processor 902 can be caused to perform a method of utilizing data set classification.
- the processor 902 is used to provide computing and control capabilities to support the operation of the entire device 90.
- the internal memory 905 provides an environment for the operation of a computer program in a non-volatile storage medium that, when executed by the processor 902, can cause the processor 902 to perform a method of utilizing data set classification.
- the network interface 903 is used for network communication.
- the processor 902 is configured to execute a computer program stored in a memory to implement any of the foregoing methods of utilizing data set classification.
- the processor 902 or 102 may be a central processing unit (CPU), and the processor may also be another general-purpose processor, a digital signal processor (Digital Signal Processor, DSP), Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device.
- the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
- a computer readable storage medium is stored, the computer readable storage medium storing a computer program, the computer program comprising program instructions, when executed by a processor, To implement any of the foregoing embodiments of the data set acquisition method.
- Also provided in another embodiment of the present application is a computer readable storage medium storing a computer program, the computer program comprising program instructions, when executed by a processor To implement any of the foregoing methods of utilizing data set classification.
- the computer readable storage medium may be an internal storage unit of the terminal described in any of the foregoing embodiments, such as a hard disk or a memory of the terminal.
- the computer readable storage medium may also be an external storage device of the terminal, such as a plug-in hard disk equipped on the terminal, a smart memory card (SMC), and a Secure Digital (SD) card. Wait.
- the computer readable storage medium may also include both an internal storage unit of the terminal and an external storage device.
- the disclosed terminal and method may be implemented in other manners.
- the terminal embodiment described above is only illustrative.
- the division of the unit is only a logical function division, and the actual implementation may have another division manner.
- a person skilled in the art can clearly understand that, for the convenience and brevity of the description, the specific working process of the terminal and the unit described above can be referred to the corresponding process in the foregoing method embodiment, and details are not described herein again.
- the foregoing is only a specific embodiment of the present application, but the scope of protection of the present application is not limited thereto, and any equivalents can be easily conceived by those skilled in the art within the technical scope disclosed in the present application. Modifications or substitutions are intended to be included within the scope of the present application. Therefore, the scope of protection of this application should be determined by the scope of protection of the claims.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本申请实施例提供一种数据集获取方法、一种利用数据集分类的方法、装置、计算机设备及存储介质。其中,所述一种数据集获取方法包括:获取消息级别的对话文本数据并进行预处理;根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据中查询出与所述规则匹配的质检点并进行标记以得到质检结果;将标记后的包括质检点的对话文本数据整合成包括质检点的会话级别的会话文本数据;根据用户对所述会话文本数据中质检点的修改请求来更新质检结果;按照预设格式从更新后的数据中提取出数据集。
Description
本申请要求于2018年3月22日提交中国专利局、申请号为201810241227.3、发明名称为“数据集获取方法、分类方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请涉及数据处理技术领域,尤其涉及一种数据集获取方法、利用数据集分类的方法、装置、计算机设备及存储介质。
在坐席销售的过程中,可能会与客户产生大量的对话文本,这些对话文本会保存在坐席销售的平台中。目前采用的方法是先随机抽取一定条数的文本内容,再通过人工的方法进行分析,如找出对话文本中不合规的地方(也称为违规的地方,即存在错误的地方),以改进不合规的地方或者来对坐席进行培训等。利用随机抽取再人工分析的方法显然效率不高,一方面,若对话文本中的数据非常大,为了能尽量多的检测到对话文本中不合规的地方,那么抽取的文本内容也会增多,人工质检的内容也会增加,而人工质检效率非常低;另一方面,由于机器是随机抽取一部分的文本内容,这样会遗漏大量的文本内容,而大量的文本内容中可能包括很多不合规的地方。若想用人工智能的算法或者模型来对文本内容进行处理,提高处理的效率,那么需要大量准确的数据作为支撑,如利用大量的数据训练模型,从而得到一个泛化能力较强的人工智能学习模型。因此若想用人工智能的算法或者模型来对文本内容进行处理,大量准确的数据是关键。
发明内容
本申请实施例提供一种数据集获取方法、一种利用数据集分类的方法、装 置、计算机设备及存储介质,可提取出准确率较高的数据集,能提高数据分类的准确率。
第一方面,本申请实施例提供了一种数据集获取方法,该方法包括:
获取消息级别的对话文本数据;对消息级别的对话文本数据进行预处理;根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据中查询出与所述规则匹配的质检点并进行标记以得到质检结果;将标记后的包括质检点的对话文本数据整合成包括质检点的会话级别的会话文本数据;根据用户对所述会话文本数据中质检点的修改请求来更新质检结果;按照预设格式从更新后的数据中提取出数据集。
本申请实施例还提供了一种利用数据集分类的方法,该方法包括:
利用上述第一方面所述的数据集获取方法提取出的数据集训练分类模型,并利用训练好的分类模型对未质检的消息级别的对话文本数据进行分类得到质检点并进行标记,以得到质检结果;根据用户对所述对话文本数据中质检点的修改请求来更新质检结果;根据更新后的数据更新分类模型;利用更新后的分类模型对未质检的消息级别的对话文本数据进行分类得到质检点并进行标记,以得到质检结果。
第二方面,本申请实施例提供了一种数据集获取装置,该装置包括用于执行上述第一方面所述的数据集获取方法的单元。
本申请实施例还提供了一种利用数据集分类的装置,该装置包括用于执行上述第一方面所述的利用数据集分类的方法的单元。
第三方面,本申请实施例提供了一种计算机设备,所述计算机设备包括存储器,以及与所述存储器相连的处理器;所述存储器用于存储计算机程序,所述处理器用于运行所述存储器中存储的计算机程序,以执行上述第一方面所述的数据集获取方法或者执行上述第一方面所述的利用数据集分类的方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令被处理器执行时,实现上述第一方面所述的数据集获取的方法或者上述第一方面所述的利用数据集分类的方法。
本申请实施例通过利用全文搜索引擎标记出质检结果,结合用户对全文搜 索引擎标记出的质检结果的更新,得到较为准确的数据源,再按照预设格式从数据源中提取出数据集,如此,提取出较为准确的数据集。通过利用数据集对分类模型进行训练,再利用训练好的分类模型对未质检的消息级别的对话文本数据进行分类以得到质检结果,结合用户对分类模型分类出的质检结果的更新,得到较为准确的质检结果,再利用更新后的数据更新分类模型,并利用更新后的分类模型进行质检点分类,如此可提高分类模型分类的准确率。
图1是本申请实施例提供的一种数据集获取方法的流程示意图;
图2是本申请实施例提供的一种数据集获取方法的子流程示意图;
图3是本申请实施例提供的一种数据集获取方法的另一子流程示意图;
图4是本申请实施例提供的一种利用数据集分类的方法的流程示意图;
图5是本申请实施例提供的一种数据集获取装置的示意性框图;
图6是本申请施例提供的标记单元的示意性框图;
图7是本申请实施例提供的提取单元的示意性框图;
图8是本申请实施例提供的一种利用数据集分类的装置的示意性框图;
图9本申请实施例提供的一种计算机设备的示意性框图。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
在本申请中,应当理解,尽管术语第一、第二等可以在此用来描述各种元素,但这些元素不应该受限于这些术语。这些术语仅用来将这些元素彼此区分开。例如,在不脱离本申请范围的前提下,第一获取单元可以被称为第二获取单元,并且类似地,第二获取单元可以被称为第一获取单元。第一获取单元和第二获取单元均为获取单元,但它们并非同一获取单元。
图1为本申请实施例提供的一种数据集获取方法的流程示意图。该方法包 括以下步骤S101-S106。
S101,获取消息级别的对话文本数据。
其中,消息级别的对话文本数据从坐席销售的平台中得到,该对话文本数据中保存的是坐席与客户之间的对话文本。该对话文本数据属于消息级别,可以理解为对话文本数据是以坐席与客户之间发送的消息为单位保存的数据,该对话文本数据由众多的消息文本数据组成,每一条消息文本数据包括消息编号、发送人、接收人、具体消息内容、发送消息的时间等。
S102,对消息级别的对话文本数据进行预处理。
其中,预处理的方法包括替换,过滤等。替换包括将对话文本数据中对应消息文本数据中的英文替换为中文等;过滤包括将对话文本数据中对应消息文本数据中的数字、标点符号、表情、乱码过滤掉。将对话文本数据中的消息文本数据进行预处理,以保留消息文本数据中的具体消息内容中的纯文本消息,方便后续的处理。
S103,根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据中查询出与对应的规则匹配的质检点并进行标记,以得到质检结果。
其中,质检点可以理解为不合规或者违规的地方,也就是存在错误的地方。每个质检点有质检点标识,如A47,表示A类中第47个质检点,质检点对应的规则包括关键词和逻辑运算。质检点和质检点对应的规则,举例如:A47,基金and分红。其中,关键词包括:基金、分红,逻辑运算包括and。A47质检点表示,若一条消息中同时出现了基金和分红,那么认为该条消息违规了。可以理解为,基金这个产品不涉及分红,也可以理解为,当说到基金这个产品时不会想到会说分红,若一条消息中同时出现基金和分红,那么这条消息违规,也就是出现错误。全文搜索引擎,指的是ElasticSearch(简写为ES)搜索引擎。ES使用关键词,并利用ES的API接口中提供的must,should,must not等组合分装实现质检点对应的规则,并根据对应的规则(查询条件)从预处理后的对话文本数据中进行查询搜索,找出匹配的质检点并标记,以得到ES质检点结果,将ES质检点结果作为质检结果。其中,标记用质检点标识来标记,ES质检点结果表示用ES查询匹配得到的质检点结果。
在一实施例中,如图2所示,所述步骤S103包括以下步骤S201-S203。
S201,将预处理后的对话文本数据进行分词。通过全文搜索引擎中的分词将预处理后的对话文本数据中每个消息文本数据中的具体消息内容分成多个单词,如消息“我来到北京清华大学”,分词的结果为“我来到北京清华大学”。
S202,对分词后的数据建立倒排索引。具体地,统计分成的词在对对话文本数据中出现的次数和位置;根据出现的次数和位置对分成的词进行倒排索引。如统计词“分红”在对话文本数据中出现的次数和位置,其中,在对话文本数据中的位置包括在哪个对话文本数据表、哪个消息文本数据(可以用消息编号来表示)等。其中,倒排索引是实现“单词-文档矩阵”的一种存储形式,通过倒排索引,可以根据单词快速获取包含这个单词的“文档列表”。如在对话文本数据中,通过该倒排索引可以根据分成的词快速获取包含这个词的消息文本数据,即哪些消息文本数据中出现了该词。
S203,根据预先设定的质检点和质检点对应的规则,利用建立的倒排索引和全文搜索引擎,从预处理后的对话文本数据中查询出与对应的规则匹配的质检点并进行标记。根据质检点对应的规则查询搜索到匹配的质检点后,进行质检点标记。如对话文本数据中标记可以理解为在对话文本数据中的每一个消息文本数据后进行相应的质检点标记。建立倒排索引后,可以加快查询匹配的速度。在数据量很大的情况下,仍能快速的完成质检点的查询匹配和标记。
S104,将标记后的包括质检点的对话文本数据整合成包括质检点的会话级别的会话文本数据。
由于消息级别的对话文本数据是以消息为单位保存的数据,所以会存在散乱无序、无上下文关系、无人员关系等,不便于用户查看,因此需要将对话文本数据整合成会话级别的会话文本数据。其中,会话级别的会话文本数据理解为以坐席与客户之间的一个对话(会话)为单位保存的数据,即会话文本数据中保存的是坐席与客户之间的多个对话数据,如每个对话数据中可以包括对话编号、对话内容。每个对话内容中对应有多条消息文本数据。
将标记后的包括质检点的对话文本数据整合成包括质检点的会话级别的会话文本数据,具体整合流程包括:从标记后的包括质检点的对话文本数据中查找每条消息文本数据中的发送人和接收人,将发送人和接收人作为一个集合; 按照集合对对话文本数据中的消息文本数据进行分组;将每组的数据按照按照发送消息的时间进行排序并按照预定格式显示,以形成包括质检点的会话级别的会话文本数据。如此就分成了多组的数据,这意味分成一组的发送人和接收人是同一个对话中的两个人,不同对话分成了不同的组;将每组的数据按照预定格式显示,其中,每组的数据即为对话内容。如预定格式可以为:发送人;接收人;对话内容;ES质检点结果。其中,对话内容中的多个消息文本数据可以按照格式:发送消息的时间[空格]具体消息内容进行显示。如2017-01-0112:01:02李老师,在吗?每个消息文本数据都对应有ES质检点结果。预定格式中还可以包括日期,即质检日期。可以简单理解为,包括质检点的会话级别的会话文本数据是将标记后的包括质检点的消息级别的对话文本数据中的消息文本数据按照时间的先后顺序和发送人接收人信息整理后的以对话为单位的多个对话数据。其中,标记后的包括质检点的消息级别的对话文本数据和包括质检点的会话级别的会话文本数据是以数据表的形式保存在数据库中,如Oracle数据库等。标记后的包括质检点的消息级别的对话文本数据和包括质检点的会话级别的会话文本数据根据数据量的多少,可以分别保存为多个数据表,也可以保存为一个数据表。
整合后的包括质检点的会话文本数据如表1所示。需要注意的是,表1中所示仅仅是一个示例。表1中可以包括多个会话文本数据,其中,每个会话文本数据中包括日期(指的是质检日期)、发送人、接收人、对话内容、ES质检点结果,其中,对话内容中包括多条消息文本数据,每条消息文本数据对应的有ES质检点结果。如消息文本数据对应的具体消息内容:在呢,又有什么好产品了吗?该条消息文本数据对应的具体消息内容的ES质检点结果为空,表示该条消息文本数据对应的具体消息内容的合格/合规。消息文本数据对应的具体消息内容:是呀,这款产品叫xxx,该产品利率为百分之5到百分之7,六个月后可以贷款年攒钱总额的10倍。该条消息文本数据对应的具体消息内容的ES质检点结果为A45,表示该条消息消息文本数据对应的具体消息内容违规了,具体对应的是A45质检点的内容。
表1 整合后的包括质检点的会话文本数据示例
S105,根据用户对所述会话文本数据中质检点的修改请求来更新质检结果。
具体地,获取用户的权限,如根据用户的账号和密码来获取用户的权限;判断当前用户的权限是否为预设权限,其中,有预设权限的用户可以更新质检结果;若为预设权限,根据预设权限的用户对所述会话文本数据中质检点的修改请求来更新质检结果。
在本申请实施例中,符合预设权限的用户可以查看到会话文本消息的具体内容、发送人、接收人、日期、ES质检点结果以及用户可以修改的选项。当用户打开含有会话文本消息的具体内容、发送人、接收人、日期、ES质检点结果以及用户可以修改的选项的页面,可以理解为接收到用户对会话文本数据质检点的修改请求。其中,用户可以修改的选项包括人工质检点结果、质检违规备注、是否合规,这几项在用户没有修改之前是空的,用户可以根据实际情况来编辑修改,以更新质检结果。其中,人工质检点结果用质检点标识表示,质检违规备注中有质检点(违规点)对应的具体文本内容以及该质检点的违规原因,该质检点的违规原因写在小括号中,放在质检点对应的具体文本内容之后。符合预设权限的用户也可以将当前的质检点改为合格/合规(不存在错误)。
需要注意的是,将人工质检点结果作为更新后的质检点结果,最终的质检点结果以更新后的质检点结果为准。其中,若人工质检点结果值与ES质检点结果值一致,那么仍然需要在人工质检点结果中填写与ES质检点结果一致的内容。若人工质检点结果为空,那么意味着该条消息的质检点结果为合格/合规。
表2 符合预设权限的用户更新后的质检结果示例
符合预设权限的用户修改后的内容如表2所示。需要注意的是,表2中所示仅仅是一个示例。从表2中可以看出,每个会话文本数据表中多了人工点质检结果、质检违规备注、是否合规这几个选项。需要注意的是,这几个选项是符合预设权限的用户打开相应表后可以看得到的。其中,对于消息文本数据的具体消息内容:是呀,这款产品叫xxx,该产品利率为百分之5到百分之7,六个月后可以贷款年攒钱总额的10倍,这条具体消息内容的ES质检点结果为:A45,人工质检点结果为:A42,A45,表示该条消息文本数据对应的具体消息内容有两处违规,对应的质检违规备注为:是呀,这款产品叫xxx,这款产品利率为百分之5到百分之7(产品利率信息不对,是百分之4到百分之8);六个月后可以贷款年攒钱总额的10倍(贷款额度没有限定),分别对应A42,A45的质检违规备注。
需要注意的是,在修改某个质检点时,根据ES质检点结果和接收到的查看指令可以在预存的文件中查看与该质检点相关的质检内容和数据完全正确的文本内容消息。其中,查看指令可通过点击查看按钮产生。预存的文件中存储的是所有质检点的内容,以及与质检点相关的数据完全正确的文本内容消息。当接收到查看指令后,根据ES质检点结果从预存的文件中找出该质检点结果对应的质检点内容以及与该质检点相关的数据完全正确的文本内容消息。如此以提高人工质检的效率和准确率,同时方便进行质检违规备注。
S106,按照预设格式从更新后的数据中提取出数据集。其中,数据集是为了训练模型,因此数据集中至少需要有质检点结果,消息文本数据对应的具体消息内容。其中,预设格式包括:消息文本数据对应的具体消息内容、每条消息文本数据对应的具体消息内容的质检点结果、质检违规备注。每条消息文本数据对应的具体消息内容的质检点结果也即消息文本数据对应的具体消息内容 更新后的质检点结果。
具体地,如图3所示,步骤S106包括以下步骤S301-S304。
S301,对于每条消息文本数据,判断消息文本数据对应的具体消息内容的质检点结果是否有多个或者质检点结果是否为空。
若质检点结果有多个,即两个及以上质检点,意味着该条消息文本数据对应的具体消息内容有两个及以上的违规点。那么需要将每个违规点都分隔开来,以方便后续对每个质检点的进一步分析,如利用提取出的数据集进行分类等。
S302,若所述质检点结果为多个,根据多个质检点将多个质检点和多个质检点对应的消息文本数据具体消息内容和对应的质检违规备注分隔开来且分别对应,形成以单个质检点为质检点结果且符合预设格式的数据。
具体地,检测消息文本数据对应的质检违规备注中的小括号对;将第一个小括号对之前的内容作为一条消息文本数据的具体消息内容、将第一个小括号对中间的内容作为该条消息文本数据对应的质检违规备注、将人工质检点结果中的第一个质检点作为该条消息文本数据对应具体消息内容的质检点结果;将下一个小括号对与前一个小括号对之间的内容作为另一条消息文本数据的具体消息内容、将下一个小括号对中间的内容作为该另一条消息文本数据的质检违规备注,将人工质检点结果中下一个质检点作为该另一条消息文本数据对应的具体消息内容的质检点结果。如此,根据多个质检点将多个质检点和多个质检点对应的消息文本数据具体消息内容和对应的质检违规备注分隔开来且分别对应,形成以单个质检点为质检点结果且符合预设格式的数据。
S303,若质检点结果为空,将消息文本数据对应的具体消息内容的质检点结果标记为合规标识。其中,合规可以用标识good表示,也可以使用其他的标识表示。
S304,提取消息文本数据对应的具体消息内容、消息文本数据对应具体消息内容的质检点结果以及质检违规备注作为数据集。提取的数据包括了将多个质检点分隔开来对应的数据,质检点结果为合规的数据,以及质检点结果为一个时对应的数据。
表3为提取出的数据集的显示实例。需要注意的是,表3仅仅是一个示例。从表3中可以看出,数据集中包括多条数据,每条数据中包括消息文本数据对 应的具体消息内容、具体消息内容对应的质检点结果(质检标识)、质检违规备注。其中,如第一条数据,具体消息内容为:是呀,这款产品叫xxx,这款产品利率为百分之5到百分之7,该条消息内容对应的质检点标识为:A42,质检违规备注为:产品利率信息不对,是百分之4到百分之8。第二条数据,具体消息内容为:六个月后可以贷款年攒钱总额的10倍,该条消息内容对应的质检点标识为:A45,质检违规备注为:贷款额度没有限定。第三条数据,具体消息内容为:有一个好的产品,是否想看看,该条消息内容对应的质检点标识为:good,标识该条消息合规。
表3 提取出的数据集示例
图4是本申请实施例提供的一种利用数据集分类的方法的流程示意图。如图4所示,该方法包括S401-S410。其中,步骤S401-S406与图1所示实施例的步骤对应,在此不再赘述。下面仅描述步骤S407-S410。
S407,利用提取出的数据集训练分类模型,并利用训练好的分类模型对未质检的消息级别的对话文本数据进行分类得到质检点并进行标记,以得到质检点结果。其中,分类模型可以为任一个多分类模型,如长短时神经网络模型、随机森林分类模型等。其中,若分类模型为神经网络模型,那么训练分类模型的过程包括:获取数据集;利用分词工具对数据集中的文本信息进行分词;利用预设词向量模型对分词后的数据进行处理,得到对应的词向量;根据词向量和数据集中对应的质检点,训练神经网络模型。其中,分词工具可以为结巴分词,选用结巴分词的精确模式对数据集中的文本信息进行分词,通过分词将对数据集中的文本信息分成多个单词,如消息“我来到北京清华大学”,分词的结果为“我 来到 北京 清华大学”。其中,词向量(word embedding)模型指的是gensim的word2vec词向量模型。word2vec实际上是个浅层的神经网络,word2vec可以在百万数量级的词典和上亿的数据集上进行高效地训练,训练得 到的训练结果为词向量,可以很好地度量词与词之间的相似性。预设词向量模型可以通过预先训练得到,训练词向量模型的过程包括:获取训练集,对训练集中的文本信息进行分词;设置训练word2vec词向量模型的参数,如最小次数min_count=5,该最小次数表示小于5次的单词会被丢弃,神经网络隐藏层的单元数size=128,迭代的次数iterator=5等;将分词后的数据作为训练数据集,训练word2vec词向量模型得到预设词向量模型。根据词向量和数据集中对应的质检点,训练神经网络模型。包括:将词向量和对应的质检点输入,训练神经网络,如若神经网络模型是长短时神经网络模型,那么训练长短时神经网络;将神经网络各个节点输出的数据输入到平均池化层,以融合神经网络各个节点的结果;再将经过平均池化层后的数据输入到softmax函数,以得到分类结果,最终使得到的分类结果和标记的质检点结果尽可能多的相同。训练好分类模型后,利用训练好的分类模型对未质检的消息级别的对话文本数据进行分类得到质检点并进行标记,以得到质检点结果。
S408,根据用户对所述会话文本数据中质检点的修改请求来更新质检结果。
S409,根据更新后的数据更新分类模型。
S410,利用更新后的分类模型对未质检的消息级别的对话文本数据进行分类得到质检点并进行标记,以得到质检结果。
该实施例将提取出的数据集用于训练分类模型,并利用分类模型对对话文本数据进行分类得到质检点,根据用户对所述会话文本数据中质检点的修改请求来更新质检结果,根据更新后的数据更新分类模型,并利用更新后的模型对对话文本数据进行分类得到质检点。该实施例根据用户对质检点结果的更新,得到较为准确的质检结果,再利用更新后的数据更新分类模型,并利用更新后的分类模型进行质检点分类,以使更新后的分类模型能更准确的进行分类,如此提高了分类模型分类的准确率。该实施例结合了人类智能,形成人在回路的混合智能范式,提升了机器智能水平。
图5是本申请实施例提供的一种数据集获取装置的示意性框图。如图5所示,该装置50包括获取单元501、预处理单元502、标记单元503、整合单元504、质检更新单元505、提取单元506。
获取单元501,用于获取消息级别的对话文本数据。
预处理单元502,用于对消息级别的对话文本数据进行预处理。
标记单元503,用于根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据中查询出与对应的规则匹配的质检点并进行标记,以得到质检结果。
在一实施例中,如图6所示,标记单元503包括分词单元601、索引单元602、匹配标记单元603。
分词单元601,用于将预处理后的对话文本数据进行分词。
索引单元602,用于对分词后的数据建立倒排索引。
匹配标记单元603,根据预先设定的质检点和质检点对应的规则,利用建立的倒排索引和全文搜索引擎,从预处理后的对话文本数据中查询出与对应的规则匹配的质检点并进行标记。
整合单元504,用于将标记后的包括质检点的对话文本数据整合成包括质检点的会话级别的会话文本数据。
将标记后的包括质检点的对话文本数据整合成包括质检点的会话级别的会话文本数据,即整合单元504,包括集合查找单元、分组单元、排序显示单元。其中,集合查找单元,用于从标记后的包括质检点的对话文本数据中查找每条消息文本数据中的发送人和接收人,将发送人和接收人作为一个集合。分组单元,用于按照集合对对话文本数据中的消息文本数据进行分组。排序显示单元,用于将每组的数据按照按照发送消息的时间进行排序并按照预定格式显示,以形成包括质检点的会话级别的会话文本数据。
质检更新单元505,用于根据用户对所述会话文本数据中质检点的修改请求来更新质检结果。
提取单元506,按照预设格式从更新后的数据中提取出数据集。
在一实施例中,如图7所示,提取单元506包括数据判断单元701、分隔单元702、添加标记单元703、数据集提取单元704。
判断单元701,用于对于每条消息文本数据,判断消息文本数据对应的具体消息内容的质检点结果是否有多个或者质检点结果是否为空。
分隔单元702,用于若所述质检点结果为多个,根据多个质检点将多个质检点和多个质检点对应的消息文本数据具体消息内容和对应的质检违规备注分 隔开来且分别对应,形成以单个质检点为质检点结果且符合预设格式的数据。
添加标记单元703,若质检点结果为空,将消息文本数据对应的具体消息内容的质检点结果标记为合规标识。
数据集提取单元704,用于提取消息文本数据对应的具体消息内容、消息文本数据对应具体消息内容的质检点结果以及质检违规备注作为数据集。提取的数据包括了将多个质检点分隔开来对应的数据,质检点结果为合规的数据,以及质检点结果为一个时对应的数据。提取出的数据集可参看图3示例。
图8是本申请实施例提供的一种利用数据集分类的装置的示意性框图。该装置80包括获取单元801、预处理单元802、标记单元803、整合单元804、质检更新单元805、提取单元806、分类单元807、模型更新单元808。其中,该实施例与图5所示实施例的区别在于:增加了分类单元807,模型更新单元808。其他如获取单元801、预处理单元802、标记单元803、整合单元804、质检更新单元805、提取单元806可参看图5实施例的描述,在此不再赘述。下面将介绍分类单元807、模型更新单元808。
分类单元807,用于利用提取出的数据集训练分类模型,并利用训练好的分类模型对未质检的消息级别的对话文本数据进行分类得到质检点并进行标记,以得到质检点结果。
质检更新单元805,还用于根据用户对所述会话文本数据中质检点的修改请求来更新质检结果。
模型更新单元808,用于根据更新后的数据更新分类模型。
分类单元807,还用于利用更新后的分类模型对未质检的消息级别的对话文本数据进行分类得到质检点并进行标记,
在其他实施例中,一种利用数据集分类的装置还包括上述对应方法实施例对应的单元。
上述装置实施例的具体工作过程和达到的有益效果,请参看前述方法实施例对应的实施过程和有益效果,在此不再赘述。
上述装置可以实现为一种计算机程序的形式,计算机程序可以在如图9所示的计算机设备上运行。
图9为本申请实施例提供的一种计算机设备的示意性框图。该计算机设备 90可以是手机、pad等便携式设备,也可以是台式机等非便携式设备。该设备90包括通过系统总线901连接的处理器902、存储器和网络接口903,其中,存储器可以包括非易失性存储介质904和内存储器905。
该非易失性存储介质904可存储操作系统9041和计算机程序9042。该计算机程序9042被执行时,可使得处理器902执行数据集获取方法。该处理器902用于提供计算和控制能力,支撑整个设备90的运行。该内存储器905为非易失性存储介质中的计算机程序的运行提供环境,该计算机程序被处理器902执行时,可使得处理器902执行数据集获取方法。该网络接口903用于进行网络通信,如获取数据等。本领域技术人员可以理解,图9中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的设备90的限定,具体的设备90可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
其中,所述处理器902用于运行存储在存储器中的计算机程序,以实现前述数据集获取方法的任一实施例。
在另一实施例中,计算机程序9042被执行时,可使得处理器902执行利用数据集分类的方法。该处理器902用于提供计算和控制能力,支撑整个设备90的运行。该内存储器905为非易失性存储介质中的计算机程序的运行提供环境,该计算机程序被处理器902执行时,可使得处理器902执行利用数据集分类的方法。该网络接口903用于进行网络通信。
其中,所述处理器902用于运行存储在存储器中的计算机程序,以实现前述利用数据集分类的方法的任一实施例。
应当理解,在本申请实施例中,所称处理器902或者102可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
在本申请的另一实施例中提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令 当被处理器执行时,以实现前述数据集获取方法的任一实施例。
在本申请的另一实施例中还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时,以实现前述利用数据集分类的方法的任一实施例。
所述计算机可读存储介质可以是前述任一实施例所述的终端的内部存储单元,例如终端的硬盘或内存。所述计算机可读存储介质也可以是所述终端的外部存储设备,例如所述终端上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡等。进一步地,所述计算机可读存储介质还可以既包括所述终端的内部存储单元也包括外部存储设备。
在本申请所提供的几个实施例中,应该理解到,所揭露的终端和方法,可以通过其它的方式实现。例如,以上所描述的终端实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的终端和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。
Claims (20)
- 一种数据集获取方法,其特征在于,所述方法包括:获取消息级别的对话文本数据;对消息级别的对话文本数据进行预处理;根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据中查询出与所述规则匹配的质检点并进行标记以得到质检结果;将标记后的包括质检点的对话文本数据整合成包括质检点的会话级别的会话文本数据;根据用户对所述会话文本数据中质检点的修改请求来更新质检结果;按照预设格式从更新后的数据中提取出数据集。
- 根据权利要求1所述的方法,其特征在于,所述根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据中查询出与所述规则匹配的质检点并进行标记,包括:将所述对话文本数据进行分词;对分词后的数据建立倒排索引;根据预先设定的质检点和质检点对应的规则,利用建立的倒排索引和全文搜索引擎,从预处理后的对话文本数据中查询出与所述规则匹配的质检点并进行标记。
- 根据权利要求1所述的方法,其特征在于,所述将标记后的包括质检点的对话文本数据整合成包括质检点的会话级别的会话文本数据,包括:从标记后的包括质检点的对话文本数据中查找每条消息文本数据中的发送人和接收人,将发送人和接收人作为一个集合;按照集合对消息文本数据进行分组;将每组的数据按照发送消息的时间进行排序并按照预定格式显示,以形成包括质检点的会话级别的会话文本数据。
- 根据权利要求1所述的方法,其特征在于,更新后的数据中包括多条消息文本数据、每条消息文本数据对应具体消息内容的质检点结果、质检违规备 注,所述预设格式包括:消息文本数据对应的具体消息内容、每条消息文本数据对应具体消息内容的质检点结果、质检违规备注;所述按照预设格式从更新后的数据中提取出数据集,包括:对于每条消息文本数据,判断消息文本数据对应具体消息内容的质检点结果是否有多个;若所述消息文本数据对应具体消息内容的质检点结果为多个,根据多个质检点将多个质检点和多个质检点对应的消息文本数据具体消息内容和对应的质检违规备注分隔开来且分别对应,形成以单个质检点为质检点结果且符合预设格式的数据;提取消息文本数据对应的具体消息内容、消息文本数据对应具体消息内容的质检点结果以及质检违规备注作为数据集。
- 一种利用数据集分类的方法,其特征在于,所述方法包括:利用如权利要求1所述的方法提取出的数据集训练分类模型,并利用训练好的分类模型对未质检的消息级别的对话文本数据进行分类得到质检点并进行标记,以得到质检结果;根据用户对所述对话文本数据中质检点的修改请求来更新质检结果;根据更新后的数据更新分类模型;利用更新后的分类模型对未质检的消息级别的对话文本数据进行分类得到质检点并进行标记,以得到质检结果。
- 一种数据集获取装置,其特征在于,所述装置包括:获取单元,用于获取消息级别的对话文本数据;预处理单元,用于对消息级别的对话文本数据进行预处理;标记单元,用于根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据中查询出与所述规则匹配的质检点并进行标记以得到质检结果;整合单元,用于将标记后的包括质检点的对话文本数据整合成包括质检点的会话级别的会话文本数据;质检更新单元,用于根据用户对所述会话文本数据中质检点的修改请求来更新质检结果;提取单元,用于按照预设格式从更新后的数据中提取出数据集。
- 根据权利要求6所述的装置,其特征在于,所述标记单元包括:分词单元,用于将所述对话文本数据进行分词;索引单元,用于对分词后的数据建立倒排索引;匹配标记单元,用于根据预先设定的质检点和质检点对应的规则,利用建立的倒排索引和全文搜索引擎,从预处理后的对话文本数据中查询出与所述规则匹配的质检点并进行标记。
- 根据权利要求6所述的装置,其特征在于,所述整合单元包括:集合查找单元,用于从标记后的包括质检点的对话文本数据中查找每条消息文本数据中的发送人和接收人,将发送人和接收人作为一个集合;分组单元,用于按照集合对消息文本数据进行分组;排序显示单元,用于将每组的数据按照发送消息的时间进行排序并按照预定格式显示,以形成包括质检点的会话级别的会话文本数据。
- 根据权利要求6所述的装置,其特征在于,更新后的数据中包括多条消息文本数据、每条消息文本数据对应具体消息内容的质检点结果、质检违规备注,所述预设格式包括:消息文本数据对应的具体消息内容、每条消息文本数据对应具体消息内容的质检点结果、质检违规备注;所述提取单元包括:数据判断单元,用于对于每条消息文本数据,判断消息文本数据对应具体消息内容的质检点结果是否有多个;分隔单元,用于若所述消息文本数据对应具体消息内容的质检点结果为多个,根据多个质检点将多个质检点和多个质检点对应的消息文本数据具体消息内容和对应的质检违规备注分隔开来且分别对应,形成以单个质检点为质检点结果且符合预设格式的数据;数据集提取单元,用于提取消息文本数据对应的具体消息内容、消息文本数据对应具体消息内容的质检点结果以及质检违规备注作为数据集。
- 一种利用数据集分类的装置,其特征在于,所述装置包括:分类单元,用于利用包括如权利要求6所述的装置对应的单元提取出的数据集训练分类模型,并利用训练好的分类模型对未质检的消息级别的对话文本数据进行分类得到质检点并进行标记,以得到质检结果;质检更新单元,还用于根据用户对所述会话文本数据中质检点的修改请求来更新质检结果;模型更新单元,用于根据更新后的数据更新分类模型;分类单元,还用于利用更新后的分类模型对未质检的消息级别的对话文本数据进行分类得到质检点并进行标记,以得到质检结果。
- 一种计算机设备,其特征在于,所述计算机设备包括存储器,以及与所述存储器相连的处理器;所述存储器用于存储计算机程序;所述处理器用于运行所述存储器中存储的计算机程序,以执行如下步骤:获取消息级别的对话文本数据;对消息级别的对话文本数据进行预处理;根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据中查询出与所述规则匹配的质检点并进行标记以得到质检结果;将标记后的包括质检点的对话文本数据整合成包括质检点的会话级别的会话文本数据;根据用户对所述会话文本数据中质检点的修改请求来更新质检结果;按照预设格式从更新后的数据中提取出数据集。
- 根据权利要求11所述的计算机设备,其特征在于,所述处理器在执行所述根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据中查询出与所述规则匹配的质检点并进行标记时,具体执行如下步骤:将所述对话文本数据进行分词;对分词后的数据建立倒排索引;根据预先设定的质检点和质检点对应的规则,利用建立的倒排索引和全文搜索引擎,从预处理后的对话文本数据中查询出与所述规则匹配的质检点并进行标记。
- 根据权利要求11所述的计算机设备,其特征在于,所述处理器在执行所述将标记后的包括质检点的对话文本数据整合成包括质检点的会话级别的会 话文本数据时,具体执行如下步骤:从标记后的包括质检点的对话文本数据中查找每条消息文本数据中的发送人和接收人,将发送人和接收人作为一个集合;按照集合对消息文本数据进行分组;将每组的数据按照发送消息的时间进行排序并按照预定格式显示,以形成包括质检点的会话级别的会话文本数据。
- 根据权利要求11所述的计算机设备,其特征在于,更新后的数据中包括多条消息文本数据、每条消息文本数据对应具体消息内容的质检点结果、质检违规备注,所述预设格式包括:消息文本数据对应的具体消息内容、每条消息文本数据对应具体消息内容的质检点结果、质检违规备注;所述处理器在执行所述按照预设格式从更新后的数据中提取出数据集时,具体执行如下步骤:对于每条消息文本数据,判断消息文本数据对应具体消息内容的质检点结果是否有多个;若所述消息文本数据对应具体消息内容的质检点结果为多个,根据多个质检点将多个质检点和多个质检点对应的消息文本数据具体消息内容和对应的质检违规备注分隔开来且分别对应,形成以单个质检点为质检点结果且符合预设格式的数据;提取消息文本数据对应的具体消息内容、消息文本数据对应具体消息内容的质检点结果以及质检违规备注作为数据集。
- 一种计算机设备,其特征在于,所述计算机设备包括存储器,以及与所述存储器相连的处理器;所述存储器用于存储计算机程序;所述处理器用于运行所述存储器中存储的计算机程序,以执行如下步骤:利用如权利要求1所述的方法提取出的数据集训练分类模型,并利用训练好的分类模型对未质检的消息级别的对话文本数据进行分类得到质检点并进行标记,以得到质检结果;根据用户对所述对话文本数据中质检点的修改请求来更新质检结果;根据更新后的数据更新分类模型;利用更新后的分类模型对未质检的消息级别的对话文本数据进行分类得到 质检点并进行标记,以得到质检结果。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令被处理器执行时,实现如下步骤:获取消息级别的对话文本数据;对消息级别的对话文本数据进行预处理;根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据中查询出与所述规则匹配的质检点并进行标记以得到质检结果;将标记后的包括质检点的对话文本数据整合成包括质检点的会话级别的会话文本数据;根据用户对所述会话文本数据中质检点的修改请求来更新质检结果;按照预设格式从更新后的数据中提取出数据集。
- 根据权利要求16所述的计算机可读存储介质,其特征在于,所述处理器在执行所述根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据中查询出与所述规则匹配的质检点并进行标记时,具体实现如下步骤:将所述对话文本数据进行分词;对分词后的数据建立倒排索引;根据预先设定的质检点和质检点对应的规则,利用建立的倒排索引和全文搜索引擎,从预处理后的对话文本数据中查询出与所述规则匹配的质检点并进行标记。
- 根据权利要求16所述的计算机可读存储介质,其特征在于,所述处理器在执行所述将标记后的包括质检点的对话文本数据整合成包括质检点的会话级别的会话文本数据时,具体实现如下步骤:从标记后的包括质检点的对话文本数据中查找每条消息文本数据中的发送人和接收人,将发送人和接收人作为一个集合;按照集合对消息文本数据进行分组;将每组的数据按照发送消息的时间进行排序并按照预定格式显示,以形成 包括质检点的会话级别的会话文本数据。
- 根据权利要求16所述的计算机可读存储介质,其特征在于,更新后的数据中包括多条消息文本数据、每条消息文本数据对应具体消息内容的质检点结果、质检违规备注,所述预设格式包括:消息文本数据对应的具体消息内容、每条消息文本数据对应具体消息内容的质检点结果、质检违规备注;所述处理器在执行所述按照预设格式从更新后的数据中提取出数据集时,具体实现如下步骤:对于每条消息文本数据,判断消息文本数据对应具体消息内容的质检点结果是否有多个;若所述消息文本数据对应具体消息内容的质检点结果为多个,根据多个质检点将多个质检点和多个质检点对应的消息文本数据具体消息内容和对应的质检违规备注分隔开来且分别对应,形成以单个质检点为质检点结果且符合预设格式的数据;提取消息文本数据对应的具体消息内容、消息文本数据对应具体消息内容的质检点结果以及质检违规备注作为数据集。
- 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令被处理器执行时,实现如下步骤:利用如权利要求1所述的方法提取出的数据集训练分类模型,并利用训练好的分类模型对未质检的消息级别的对话文本数据进行分类得到质检点并进行标记,以得到质检结果;根据用户对所述对话文本数据中质检点的修改请求来更新质检结果;根据更新后的数据更新分类模型;利用更新后的分类模型对未质检的消息级别的对话文本数据进行分类得到质检点并进行标记,以得到质检结果。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810241227.3 | 2018-03-22 | ||
CN201810241227.3A CN108491388B (zh) | 2018-03-22 | 2018-03-22 | 数据集获取方法、分类方法、装置、设备及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2019179010A1 true WO2019179010A1 (zh) | 2019-09-26 |
Family
ID=63319304
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2018/100779 WO2019179010A1 (zh) | 2018-03-22 | 2018-08-16 | 数据集获取方法、分类方法、装置、设备及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN108491388B (zh) |
WO (1) | WO2019179010A1 (zh) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109582833B (zh) * | 2018-11-06 | 2023-09-22 | 创新先进技术有限公司 | 异常文本检测方法及装置 |
CN109740760B (zh) * | 2018-12-25 | 2024-04-05 | 平安科技(深圳)有限公司 | 文本质检自动化训练方法、电子装置及计算机设备 |
CN109815487B (zh) * | 2018-12-25 | 2023-04-18 | 平安科技(深圳)有限公司 | 文本质检方法、电子装置、计算机设备及存储介质 |
CN109729383B (zh) * | 2019-01-04 | 2021-11-02 | 深圳壹账通智能科技有限公司 | 双录视频质量检测方法、装置、计算机设备和存储介质 |
CN109831665B (zh) * | 2019-01-16 | 2022-07-08 | 深圳壹账通智能科技有限公司 | 一种视频质检方法、系统及终端设备 |
CN109815717A (zh) * | 2019-01-17 | 2019-05-28 | 平安科技(深圳)有限公司 | 数据权限管理方法、数据访问方法、装置、设备及介质 |
CN111538809B (zh) * | 2020-04-20 | 2021-03-16 | 马上消费金融股份有限公司 | 一种语音服务质量检测方法、模型训练方法及装置 |
CN111988479B (zh) * | 2020-08-20 | 2021-04-20 | 浙江企蜂信息技术有限公司 | 通话信息处理方法、装置、计算机设备及存储介质 |
CN112468658B (zh) * | 2020-11-20 | 2022-10-25 | 平安普惠企业管理有限公司 | 语音质量检测方法、装置、计算机设备及存储介质 |
CN114707833A (zh) * | 2022-03-24 | 2022-07-05 | 深圳追一科技有限公司 | 会话质检方法、装置、计算机设备和存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140316765A1 (en) * | 2013-04-23 | 2014-10-23 | International Business Machines Corporation | Preventing frustration in online chat communication |
CN105184315A (zh) * | 2015-08-26 | 2015-12-23 | 北京中电普华信息技术有限公司 | 一种质检处理方法及系统 |
CN105991849A (zh) * | 2015-02-13 | 2016-10-05 | 华为技术有限公司 | 一种坐席服务方法、装置及系统 |
CN106294355A (zh) * | 2015-05-14 | 2017-01-04 | 阿里巴巴集团控股有限公司 | 一种业务对象属性的确定方法及设备 |
CN106776832A (zh) * | 2016-11-25 | 2017-05-31 | 上海智臻智能网络科技股份有限公司 | 用于问答交互日志的处理方法、装置及系统 |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8862461B2 (en) * | 2011-11-30 | 2014-10-14 | Match.Com, Lp | Fraud detection using text analysis |
CN102799579B (zh) * | 2012-07-18 | 2015-01-21 | 西安理工大学 | 具有错误自诊断和自纠错功能的统计机器翻译方法 |
CN106407211B (zh) * | 2015-07-30 | 2019-08-06 | 富士通株式会社 | 对实体词的语义关系进行分类的方法和装置 |
CN105187674B (zh) * | 2015-08-14 | 2020-02-14 | 上海银赛计算机科技有限公司 | 服务录音的合规检查方法及装置 |
CN105141787A (zh) * | 2015-08-14 | 2015-12-09 | 上海银天下科技有限公司 | 服务录音的合规检查方法及装置 |
CN105791543A (zh) * | 2016-02-23 | 2016-07-20 | 北京奇虎科技有限公司 | 一种清理短信的方法、装置、客户端和系统 |
CN105912607A (zh) * | 2016-04-06 | 2016-08-31 | 普强信息技术(北京)有限公司 | 一种基于文法规则的分类方法 |
CN106776806A (zh) * | 2016-11-22 | 2017-05-31 | 广东电网有限责任公司佛山供电局 | 呼叫中心质检语音的评分方法和系统 |
CN107204195A (zh) * | 2017-05-19 | 2017-09-26 | 四川新网银行股份有限公司 | 一种基于情绪分析的智能质检方法 |
CN107491433A (zh) * | 2017-07-24 | 2017-12-19 | 成都知数科技有限公司 | 基于深度学习的电商异常金融商品识别方法 |
CN107547527A (zh) * | 2017-08-18 | 2018-01-05 | 上海二三四五金融科技有限公司 | 一种语音质检金融安全控制系统及控制方法 |
CN110956956A (zh) * | 2019-12-13 | 2020-04-03 | 集奥聚合(北京)人工智能科技有限公司 | 基于策略规则的语音识别方法及装置 |
-
2018
- 2018-03-22 CN CN201810241227.3A patent/CN108491388B/zh active Active
- 2018-08-16 WO PCT/CN2018/100779 patent/WO2019179010A1/zh active Application Filing
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140316765A1 (en) * | 2013-04-23 | 2014-10-23 | International Business Machines Corporation | Preventing frustration in online chat communication |
CN105991849A (zh) * | 2015-02-13 | 2016-10-05 | 华为技术有限公司 | 一种坐席服务方法、装置及系统 |
CN106294355A (zh) * | 2015-05-14 | 2017-01-04 | 阿里巴巴集团控股有限公司 | 一种业务对象属性的确定方法及设备 |
CN105184315A (zh) * | 2015-08-26 | 2015-12-23 | 北京中电普华信息技术有限公司 | 一种质检处理方法及系统 |
CN106776832A (zh) * | 2016-11-25 | 2017-05-31 | 上海智臻智能网络科技股份有限公司 | 用于问答交互日志的处理方法、装置及系统 |
Also Published As
Publication number | Publication date |
---|---|
CN108491388A (zh) | 2018-09-04 |
CN108491388B (zh) | 2021-02-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2019179010A1 (zh) | 数据集获取方法、分类方法、装置、设备及存储介质 | |
WO2019179022A1 (zh) | 文本数据质检方法、装置、设备及计算机可读存储介质 | |
US20230013306A1 (en) | Sensitive Data Classification | |
WO2019200806A1 (zh) | 文本分类模型的生成装置、方法及计算机可读存储介质 | |
CN109800320B (zh) | 一种图像处理方法、设备及计算机可读存储介质 | |
US20220004878A1 (en) | Systems and methods for synthetic document and data generation | |
US10216838B1 (en) | Generating and applying data extraction templates | |
CN108182245A (zh) | 人对象属性分类知识图谱的构建方法及装置 | |
CN110929125A (zh) | 搜索召回方法、装置、设备及其存储介质 | |
US10216837B1 (en) | Selecting pattern matching segments for electronic communication clustering | |
CA3177671A1 (en) | Enquiring method and device based on vertical search, computer equipment and storage medium | |
CN104778283B (zh) | 一种基于微博的用户职业分类方法及系统 | |
CN109508373A (zh) | 企业舆情指数的计算方法、设备及计算机可读存储介质 | |
CN114153962A (zh) | 一种数据匹配方法、装置及电子设备 | |
CN111177367A (zh) | 案件分类方法、分类模型训练方法及相关产品 | |
CN110020430B (zh) | 一种恶意信息识别方法、装置、设备及存储介质 | |
CN107918618A (zh) | 数据处理方法及装置 | |
CN113268615A (zh) | 资源标签生成方法、装置、电子设备及存储介质 | |
CN110765760A (zh) | 一种法律案件分配方法、装置、存储介质和服务器 | |
WO2021154429A1 (en) | Siamese neural networks for flagging training data in text-based machine learning | |
US11321531B2 (en) | Systems and methods of updating computer modeled processes based on real time external data | |
CN114092948A (zh) | 一种票据识别方法、装置、设备以及存储介质 | |
CN107688594B (zh) | 基于社交信息的风险事件的识别系统及方法 | |
CN117114142A (zh) | 基于ai的数据规则表达式生成方法、装置、设备及介质 | |
CN108171589A (zh) | 验证方法及装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 18911039 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 13/01/2021) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 18911039 Country of ref document: EP Kind code of ref document: A1 |