WO2019179022A1 - 文本数据质检方法、装置、设备及计算机可读存储介质 - Google Patents

文本数据质检方法、装置、设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2019179022A1
WO2019179022A1 PCT/CN2018/102069 CN2018102069W WO2019179022A1 WO 2019179022 A1 WO2019179022 A1 WO 2019179022A1 CN 2018102069 W CN2018102069 W CN 2018102069W WO 2019179022 A1 WO2019179022 A1 WO 2019179022A1
Authority
WO
WIPO (PCT)
Prior art keywords
text data
quality
neural network
preset
network model
Prior art date
Application number
PCT/CN2018/102069
Other languages
English (en)
French (fr)
Inventor
张雨嘉
任鹏飞
倪振
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2019179022A1 publication Critical patent/WO2019179022A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • the present application relates to the field of data processing technologies, and in particular, to a text data quality inspection method, apparatus, device, and computer readable storage medium.
  • a large amount of dialogue text may be generated with the customer, and these dialogue texts will be saved in the platform for agent sales. If you want to analyze the dialogue text, the current method is to randomly extract a certain number of text content, and then analyze it manually. However, the results of random extraction by the machine may contain a large amount of compliant text content. This not only makes the analysts less efficient, but also misses a lot of text content. If the missing text contains content that is not compliant (where there is a mistake), such as a major violation, it will cause dissatisfaction to the customer, with no small impact.
  • the embodiment of the present application provides a text data quality inspection method, apparatus, device, and computer readable storage medium, which can obtain quality inspection results obtained by matching with a full-text search engine, and quality inspection results classified by using a preset neural network model. Combined according to preset rules to improve quality inspection efficiency and accuracy.
  • an embodiment of the present application provides a text data quality inspection method, where the method includes:
  • the data in the data is classified and the pre-processed session text data is marked with the classified quality check point; in the pre-processed session text data, the quality check point result and the pre-marked by the full-text search engine are initialized according to a preset rule.
  • the results of the quality checkpoints marked by the neural network model are integrated, and the integrated quality checkpoint results are used as the final quality checkpoint results.
  • an embodiment of the present application provides a text data quality checking apparatus, the apparatus comprising means for performing the text data quality checking method according to the first aspect.
  • an embodiment of the present application provides a computer device, where the computer device includes a memory, and a processor connected to the memory, the memory is configured to store a computer program, and the processor is configured to run the memory A computer program stored in the method for performing the text data quality check described in the first aspect above.
  • an embodiment of the present application provides a computer readable storage medium, where the computer readable storage medium stores a computer program, where the computer program includes program instructions, and when the program instructions are executed by a processor, implementing the foregoing The method of text data quality inspection described in the first aspect.
  • the quality inspection result obtained by matching the full-text search engine and the quality inspection result classified by using the preset neural network model may be combined according to a preset rule to improve the quality and accuracy of the text data quality inspection.
  • FIG. 1 is a schematic flowchart diagram of a text data quality inspection method according to an embodiment of the present application
  • FIG. 2 is a schematic diagram of a sub-flow of a text data quality inspection method according to an embodiment of the present application
  • FIG. 3 is a schematic diagram of another sub-flow of a text data quality inspection method according to an embodiment of the present application.
  • FIG. 4 is a schematic diagram of another sub-flow of a text data quality inspection method according to an embodiment of the present application.
  • FIG. 5 is a schematic diagram of another sub-flow of a text data quality inspection method according to an embodiment of the present application.
  • FIG. 6 is a schematic block diagram of a text data quality inspection apparatus according to an embodiment of the present application.
  • FIG. 7 is a schematic block diagram of a first marking unit provided by an embodiment of the present application.
  • FIG. 8 is a schematic block diagram of a training unit provided by an embodiment of the present application.
  • FIG. 9 is a schematic block diagram of a second marking unit provided by an embodiment of the present application.
  • FIG. 10 is a schematic block diagram of an integration unit provided by an embodiment of the present application.
  • FIG. 11 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • first, second, etc. may be used herein to describe various elements, these elements should not be limited to these terms. These terms are only used to distinguish these elements from each other.
  • first acquisition unit may be referred to as a second acquisition unit without departing from the scope of the present application, and similarly, the second acquisition unit may be referred to as a first acquisition unit.
  • the first acquisition unit and the second acquisition unit are both acquisition units, but they are not the same acquisition unit.
  • FIG. 1 is a schematic flowchart diagram of a text data quality inspection method according to an embodiment of the present application. The method includes the following steps S101-S106.
  • the message-level dialog text data is obtained from the platform of the agent sales, and the dialog text data stores the dialogue text between the agent and the client.
  • the dialog text data belongs to the message level, and can be understood as the dialog text data is data saved in units of messages sent between the agent and the client.
  • the dialog text data is composed of a plurality of message text data, and each message text data includes a message number. , sender, recipient, specific message content, time to send the message, etc.
  • Session-level session text data is understood as data stored in a conversation (session) between the agent and the client, that is, the conversation text data holds a plurality of conversation data between the agent and the client, such as each conversation data.
  • Message-level conversation text data is stored in units of messages, so there will be scattered and disordered, no context, no staff, etc., which is not convenient for users to view.
  • the session-level session text data can be obtained according to the message-level conversation text data processing, and the specific processing process includes: pre-processing the message-level conversation text data, such as de-duplication, etc.; from the pre-processed message level Find the sender and receiver in each message text data in the dialog text data; use the sender and the receiver as a set, group the message text data in the dialog text data according to the set, and collect the same message text data. Divided into a group, so divided into multiple sets of data, which means that the sender and receiver are divided into two groups in the same conversation, different conversations are divided into different groups; the text data of each group is followed according to the text The sorting is performed in chronological order, and the sorted message text data is displayed according to a predetermined format.
  • the predetermined format may be: the time when the message is sent [space] the sender [colon] specific message content.
  • the session text data of the session level is a plurality of dialog data in a dialogue unit after the message text data in the message text data of the message level is sorted in time order and the sender information.
  • the message level dialogue text data and the session level session text data are stored in the database in the form of a data table, such as an Oracle database.
  • Message level Conversation text data and session level session text data can be saved as multiple data tables or as a data table, depending on the amount of data.
  • preprocessing the conversation text data and the session text data preprocessing the conversation text data and the session text data.
  • Methods of preprocessing include replacement, filtering, and the like.
  • the replacement includes replacing the English in the corresponding message text data in the conversation text data and the conversation text data with Chinese, etc.; filtering includes filtering out the numbers, punctuation marks, expressions, and garbled characters in the corresponding message text data in the conversation text data and the conversation text data.
  • the dialog text data and the message text data in the session text data are preprocessed to preserve the plain text message in the specific message content in the message text data, which facilitates subsequent processing.
  • S103 Search for and match the matched quality check points from the preprocessed dialog text data and the session text data by using a full text search engine according to a preset rule corresponding to the quality check point and the quality check point.
  • the quality inspection point can be understood as a place of non-compliance or violation, that is, where there is a mistake.
  • Each quality inspection point has a quality inspection point identifier, such as A47, which represents the 47th quality inspection point in category A.
  • the rules corresponding to the quality inspection point include keywords and logic operations.
  • the rules corresponding to the quality inspection point and the quality inspection point for example: A47, fund and dividend.
  • the keywords include: funds, dividends, logical operations including and.
  • the A47 quality checkpoint indicates that if a fund and dividends appear in a message, the message is considered to be in violation. It can be understood that the fund does not involve dividends. It can also be understood that when it comes to the fund, it will not be expected to say dividends. If a fund has a fund and dividends at the same time, then this message is illegal, that is, an error occurs. .
  • the full-text search engine refers to the ElasticSearch (short for ES) search engine. ES uses keywords, and uses ES's API interface must, should, must not combine to implement the rules corresponding to the quality checkpoints, query search from dialog text data and session text data, find matching quality checkpoints and mark , where the mark is marked with a quality checkpoint identifier.
  • the query search is performed from the dialog text data and the session text data respectively. It can be understood that some quality checkpoints can be expressed by the message level text, then the query search is performed from the dialog text data; there are some quality check orders from the message. Level text can't be obtained, but it needs to be available from multiple message level texts before and after, so you need to search from the session text data.
  • the step S103 includes the following steps S201-S203.
  • S202 Create an inverted index on the data after the word segmentation. Specifically, the number and position of the statistically divided words appearing in the dialog text data and the conversation text data; the divided words are indexed in reverse according to the number of occurrences and positions. Such as the number and location of the statistical word "dividend" in the dialog text data and the session text data, wherein the position in the dialog text data is included in which dialog text data table, which message text data (can be represented by message number) The location in the session text data includes which session text data table, which session (which can be represented by a dialog number), and the like.
  • the inverted index is a storage form that implements the "word-document matrix".
  • the inverted index By inverting the index, the "document list" containing the word can be quickly obtained according to the word. As in the session text data, the inverted index can quickly obtain a list of conversations containing the word based on the divided words, such as which words appear in the conversation.
  • S203 Query and match the matched quality check points from the dialog text table and the conversation text table by using the established inverted index and the full-text search engine according to a preset rule corresponding to the quality check point and the quality check point.
  • the quality inspection point mark is performed.
  • the mark in the dialog text data can be understood as the corresponding quality check mark after each message text data in the dialog text data
  • the mark in the session text data can be understood as the corresponding content of each dialog content in the session text data.
  • the corresponding quality checkpoint mark is performed.
  • S104 Select one of the trained neural network models as a preset neural network model.
  • selecting one of the trained neural network models as the preset neural network model includes: selecting the neural network model with the highest classification accuracy rate from a plurality of different trained neural network models as the preset neural network model.
  • the plurality of neural network models may be a Long Short-Term Memory (LSTM) model (abbreviated as Model 1), a long-short neural network model, and an Attention Mechanism (referred to as Model 2).
  • LSTM Long Short-Term Memory
  • BLSTM Bidirectional Long Short-term Memory
  • Model 4 bidirectional long-term and short-term memory neural network combined with attention mechanism
  • the test sample set is obtained, for example, a certain number of dialog text data and session text data containing the quality checkpoint may be obtained in advance as a test sample set, wherein the quality is included in the test sample set.
  • the dialogue text data and the session text data of the checkpoint can be marked and saved by using the full-text search engine to mark the conversation text data and the conversation text data, or can be combined with the result of the manual quality inspection; using the test sample set, the model 1 and the model 2.
  • Model 3 and Model 4 classify the quality inspection points and mark the quality inspection points of the classification; calculate the classification accuracy rate of each model, wherein the accuracy calculation formula is: (A i ⁇ B)/C, C indicates artificial Review the number of samples, A indicates the number of model quality inspection points in the number of manual review samples, B indicates the number of artificial quality inspection points in the number of manual review samples, i ⁇ (1, 2, 3, 4), respectively representing the corresponding model; A model with a high classification accuracy rate is used as a preset neural network model.
  • a model with a high classification accuracy rate is selected from a plurality of different neural network models as a preset neural network model to classify the dialog text data and the conversation text data and mark the classified quality check points to Improve the accuracy of model classification quality checkpoints.
  • the method before selecting a neural network model with the highest classification accuracy from a plurality of different trained neural network models as a preset neural network model, the method further includes training a plurality of different neural network models, such as A certain number of dialog text data and session text data containing quality checkpoints are obtained in advance as training sample sets.
  • the training sample set does not contain the same data as the test sample set.
  • the training sample set data is larger than Test the data of the sample set; use the training sample set, and train the model 1, model 2, model 3 and model 4 at the same time to obtain the trained model 1, model 2, model 3, and model 4.
  • selecting one of the trained neural network models as the preset neural network model includes: using a trained neural network model as a preset neural network model. Understandably, only one neural network model is selected for training, and the trained neural network model is used as a preset neural network model. In this way, it is not necessary to select the neural network model with the highest classification accuracy from multiple neural network models as the preset neural network model. However, whether it is a neural network model or a neural network model with the highest classification accuracy from multiple neural network models, the process of training the neural network model is required.
  • the process of training the neural network model includes the following steps S301-S304.
  • the word segmentation tool may be a staging word segmentation, and the staging word segmentation is used to divide each message text data in the conversation text data and the conversation text data into a plurality of words.
  • Stuttering word segmentation supports three modes: one, precise mode, trying to cut the sentence most accurately, suitable for text analysis; second, the full mode, scanning all the words in the sentence that can be worded, the speed is very fast, but can not Resolve ambiguity; Third, the search engine model, on the basis of the precise mode, split the long words again, improve the recall rate, suitable for search engine segmentation.
  • the precise mode is used to segment the conversational text data and the conversational text data containing the quality checkpoint.
  • Each message text data of the dialogue text data is divided into multiple words by word segmentation, such as the message "I came to Beijing Tsinghua University", and the result of the word segmentation is "I came to Beijing Tsinghua University”.
  • the word embedding model refers to the word2vec word vector model of gensim.
  • gensim is a tool for mining the semantic structure of documents by measuring phrases (or higher-level structures such as whole sentences or documents).
  • Gensim takes a collection of text documents as input and generates a "vector" to represent the text content of the corpus, thereby implementing semantic mining, which can be used to train a "model.”
  • Word2vec is a "model” of gensim that can compute word vectors.
  • Word2vec is actually a shallow neural network. Word2vec can effectively train on millions of dictionary and hundreds of millions of data sets. The training result is a word vector, which can measure the word-to-word well. Similarity.
  • the preset word vector model is obtained by pre-training.
  • the process of training the word vector model is as follows: acquiring dialogue text data and conversation text data containing quality checkpoints, wherein if the dialog text data and the session text data containing the quality checkpoints are not passed Pre-processing, then pre-processing is required.
  • the pre-processing methods include substitution, filtering, etc.; segmentation of conversational text data and conversational text data containing quality checkpoints, wherein the exact pattern of the stapling word segmentation tool can be used for quality inspection.
  • the amount of data used to train the word vector model is usually very large, much larger than the amount of data needed to train the neural network model.
  • a large number of dialog text data and session text data containing quality checkpoints may be segmented, and the word vector model is used to train the word vector model to obtain a preset word vector model, and then take the data from the word segmentation. Part of the data is used to train the neural network model to obtain a preset neural network model.
  • the data of the training word vector model and the training neural network model may be the same batch of data, and in other embodiments, the data of different batches, that is, the data of the training word vector model and the data of the training neural network model are not. same.
  • the neural network model according to the word vector and the corresponding quality check point. Specifically, the word vector and the corresponding quality checkpoint are input, and the neural network is trained. If the neural network model is a long-term neural network model, then the long-term neural network is trained. If the neural network model is a two-way long-term memory neural network model, then Train the two-way long-term neural network; input the data output from each node of the neural network to the average pooling layer to fuse the results of each node of the neural network; then input the data after the average pooled layer into the softmax function to obtain the classification result. In the end, the classification result and the quality check point result of the mark are as much as possible.
  • the preset neural network model or the preset word vector model specifically, the new dialogue text data containing the quality checkpoint.
  • the session text data is updated as input so that the preset neural network model or the word vector model can always adapt to changes in the new data.
  • the step S105 includes the following steps S401-S404.
  • the word segmentation tool may be a staging word segmentation, and the staging word segmentation is used to divide each message text data in the conversation text data and the conversation text data into a plurality of words. Specifically, the utterance text data and the conversation text data containing the quality check points are segmented by using the exact pattern in the sufficiency participle. Each message text data of the conversation text data is divided into a plurality of words by word segmentation.
  • the mark in the dialog text data can be understood as the corresponding quality check point mark after each message text data in the dialog text data, and the mark in the session text data can be understood as multiple corresponding to each dialog content in the session text data. After the message text data, the corresponding quality checkpoint mark is performed.
  • S106 Integrate the quality checkpoint result marked by the full-text search engine with the quality checkpoint result marked by the preset neural network model according to a preset rule, and use the integrated quality checkpoint result as the final quality checkpoint result.
  • the step S106 includes the following steps S501-S504.
  • the full-text search engine combines the pre-processed dialog text data and the quality check points marked in the session text data to obtain a quality check point result marked by the full-text search engine.
  • the processed text data processed by the full-text search engine and containing the quality checkpoints is processed to obtain the processed session text data, and the processed session text data and the quality checkpoint marked by the full-text search engine are included.
  • the session text data is merged to obtain the quality check point result marked by the full-text search engine, wherein the processed session text data is a quality check point marked in units of message text data.
  • the quality checkpoints marked by the full-text search engine include quality checkpoints marked in units of message text data, and quality checkpoints marked in units of dialogs.
  • the quality checkpoint result marked by the full-text search engine is finally displayed in the session text data, and it can be understood that the display in the session text data can facilitate the user to further refer to the related content.
  • the processing process mainly includes: finding the sender and the receiver in each message text data from the dialog text data containing the quality checkpoint marked by the full-text search engine; using the sender and the receiver as a set And grouping the message text data in the dialog text data according to the set; sorting the message text data in each group in chronological order, and displaying the sorted message text data according to a predetermined format, for example, the predetermined format may be: sending The time of the message [space] sender [colon] specific message content.
  • the processed session text data is a plurality of dialog data in a dialogue unit after the message text data in the dialog text data is sorted in time order.
  • the pre-processed dialog text data and the quality check points marked in the session text data are combined by using a preset neural network model to obtain a quality checkpoint result marked by the preset neural network model.
  • the processed speech text data containing the quality checkpoints marked by the preset neural network model is processed to obtain the processed session text data, and the processed session text data and the predetermined neural network model are marked.
  • the session text data containing the quality checkpoints is merged to obtain the quality checkpoint result marked by the preset neural network model, wherein the processed session text data has a quality checkpoint marked with the message text data as a unit.
  • the quality checkpoint results marked by the preset neural network model include quality check points marked in units of message text data, and quality check points marked in units of dialogue.
  • the result of the quality checkpoint marked by the preset neural network model is finally displayed in the session text data, and it can be understood that the display in the session text data can facilitate the user to further refer to the related content.
  • the process of the processing is as described above, and details are not described herein again.
  • S503 Integrate the quality checkpoint result marked by the full-text search engine with the quality checkpoint result marked by the preset neural network model according to a preset rule.
  • the preset rule may be a merge, which may be understood as a logical AND operation, that is, a logical check operation is performed between the quality checkpoint result marked by the full-text search engine and the quality checkpoint result marked by the preset neural network model.
  • the quality checkpoint result marked by the full-text search engine is A47
  • the quality checkpoint result marked by the preset neural network model is B16
  • the result of the quality checkpoint of the text data of the message is A47, B16
  • the quality checkpoint result marked by the full text search engine is A47
  • the predetermined neural network model is used to mark The result of the quality checkpoint is empty, then after the logical AND operation, the quality check result of the text data of the message is A47.
  • the preset rule may also be a result of selecting a quality checkpoint with a higher accuracy rate as the result of the integrated quality checkpoint.
  • the quality checkpoint result marked by the preset neural network model is selected as the integrated The result of the quality checkpoint; if the accuracy of the quality checkpoint marked by the full-text search engine is not lower than the accuracy of the quality checkpoint result marked by the preset neural network model, then the quality check marked by the full-text search engine is selected. The result is used as the result of the integrated quality checkpoint.
  • the integrated quality inspection point result is used as the final quality inspection point result.
  • the quality checkpoint marked by the full-text search engine in the pre-processed dialog text data and the pre-processed dialog text data marked by the preset neural network model may also be marked according to a preset rule.
  • the quality checkpoint is processed, and the quality checkpoint marked by the full-text search engine in the preprocessed session text data and the quality of the pre-processed session text data marked by the preset neural network model are marked according to a preset rule.
  • the checkpoint is processed, and the two processed results are combined to obtain the final quality checkpoint result.
  • the full-text search engine and the preset neural network model to mark the quality checkpoint from the pre-processed dialog text and the conversation text, the method of random sampling and manual sampling is avoided, and only a part of the data is performed. Processing, and missing other possible quality inspection points.
  • the program can process all the data, efficiently check all the data, and improve the efficiency of quality inspection.
  • the two different models mark the quality checkpoints from the pre-processed dialogue text and the conversation text, and integrate the quality checkpoints of the two different models to find out all Possible quality inspection points to improve the accuracy of quality inspection.
  • FIG. 6 is a schematic block diagram of a text data quality inspection apparatus according to an embodiment of the present application.
  • the device 60 includes an obtaining unit 601, a pre-processing unit 602, a first marking unit 603, a selecting unit 604, a second marking unit 605, an integrating unit 606, and a training unit 607.
  • the obtaining unit 601 is configured to obtain session text data of the message level and session text data of the session level.
  • the pre-processing unit 602 is configured to pre-process the conversation text data and the session text data. Methods of preprocessing include replacement, filtering, and the like.
  • the first marking unit 603 is configured to use a full-text search engine to query a matching quality check point from the pre-processed dialog text data and the session text data according to a preset rule corresponding to the quality check point and the quality check point. Mark it.
  • the first marking unit 603 includes a data word segmentation unit 701, an indexing unit 702, and a query marking unit 703.
  • the data segmentation unit 701 is configured to perform segmentation of the preprocessed dialog text data and the session text data.
  • the indexing unit 702 is configured to establish an inverted index on the data after the word segmentation.
  • the query marking unit 703 is configured to query the matching quality checkpoint from the dialog text table and the conversation text table by using the established inverted index and the full-text search engine according to a preset rule corresponding to the quality check point and the quality check point. And mark it.
  • the selecting unit 604 is configured to select one of the trained neural network models as the preset neural network model.
  • device 60 also includes a training unit 607.
  • the training unit 607 includes a first acquisition unit 801, a first word segmentation unit 802, a first word vector unit 803, and a model training unit 804.
  • the first obtaining unit 801 is configured to acquire dialog text data and session text data including the quality check point.
  • the first word segmentation unit 802 is configured to perform word segmentation on the dialogue text data containing the quality checkpoint and the text information in the conversation text data by using the word segmentation tool.
  • the first word vector unit 803 is configured to process the data after the word segmentation by using a preset word vector model to obtain a corresponding word vector.
  • the preset word vector model is obtained by pre-training.
  • the training unit further includes a preset word vector obtaining unit, and the preset word vector obtaining unit is configured to train the word vector model to obtain a preset word vector model.
  • the preset word vector obtaining unit includes a quality inspection data acquiring unit, a quality inspection data word segment unit, a setting unit, and a word vector training unit.
  • the model training unit 804 is configured to train the neural network model according to the word vector and the corresponding quality check point.
  • the second marking unit 605 is configured to classify the pre-processed dialog text data and the data in the session text data by using a preset neural network model and mark the classified quality check points.
  • the second marking unit 605 includes a second acquiring unit 901, a second word segmentation unit 902, a second word vector unit 903, and a classification unit 904.
  • the second obtaining unit 901 is configured to obtain the pre-processed dialog text data and the session text data.
  • the second word segmentation unit 902 is configured to use the word segmentation tool to segment the pre-processed conversation text data and the text information in the conversation text data.
  • the second word vector unit 903 is configured to process the data after the word segmentation by using a preset word vector model to obtain a corresponding word vector.
  • the second marking unit further includes a preset word vector obtaining unit, and the preset word vector obtaining unit is configured to train the word vector model to obtain a preset word vector model.
  • the preset word vector obtaining unit includes a quality inspection data acquiring unit, a quality inspection data word dividing unit, a setting unit, and a word vector training unit. Specifically, please refer to the description of the preset word vector acquisition unit part in the training unit.
  • the classification unit 904 is configured to perform classification according to the corresponding word vector by using a preset neural network model, obtain the classified quality inspection points, and mark the quality inspection points.
  • the integration unit 606 is configured to integrate the quality checkpoint result marked by the full-text search engine with the quality checkpoint result marked by the preset neural network model according to a preset rule, and use the integrated quality checkpoint result as the final quality check. Click the result.
  • the integration unit 606 includes a first merging unit 101, a second merging unit 102, a result integration unit 103, and a quality checkpoint result determining unit 104.
  • the first merging unit 101 is configured to combine the quality check points marked in the pre-processed dialog text data and the session text data by the full-text search engine to obtain the quality check point result marked by the full-text search engine.
  • the second merging unit 102 is configured to combine the pre-processed dialog text data and the quality check points marked in the session text data by using a preset neural network model to obtain a quality check point result marked by the preset neural network model.
  • the result integration unit 103 is configured to integrate the quality checkpoint result marked by the full-text search engine with the quality checkpoint result marked by the preset neural network model according to a preset rule.
  • the quality check point result determining unit 104 is configured to use the integrated quality check point result as the final quality check point result.
  • the above apparatus may be embodied in the form of a computer program that can be run on a computer device as shown in FIG.
  • FIG. 11 is a schematic block diagram of a computer device according to an embodiment of the present application.
  • the computer device 110 may be a portable device such as a mobile phone or a pad, or may be a non-portable device such as a desktop computer.
  • the device 110 includes a processor 112, a memory, and a network interface 113 that are coupled by a system bus 111, wherein the memory can include a non-volatile storage medium 114 and an internal memory 115.
  • the non-volatile storage medium 114 can store an operating system 1141 and a computer program 1142.
  • the processor 112 can be caused to perform a text data quality check method.
  • the processor 112 is used to provide computing and control capabilities to support the operation of the entire device 110.
  • the internal memory 115 provides an environment for the operation of a computer program in a non-volatile storage medium that, when executed by the processor 112, causes the processor 112 to perform a text data quality inspection method.
  • the network interface 113 is used for network communication, such as acquiring data and the like. It will be understood by those skilled in the art that the structure shown in FIG.
  • 11 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation of the device 110 to which the solution of the present application is applied.
  • the specific device 110 may be It includes more or fewer components than those shown in the figures, or some components are combined, or have different component arrangements.
  • the processor 112 is configured to execute a computer program stored in a memory to implement any of the foregoing text data quality inspection methods.
  • the processor 112 may be a central processing unit (CPU), and the processor may also be another general-purpose processor, a digital signal processor (DSP). , Application Specific Integrated Circuit (ASIC), Field-Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware component, etc.
  • the general purpose processor may be a microprocessor or the processor or any conventional processor or the like.
  • a computer readable storage medium is stored, the computer readable storage medium storing a computer program, the computer program comprising program instructions, when executed by a processor, To achieve any of the foregoing text data quality inspection methods.
  • the computer readable storage medium may be an internal storage unit of the terminal described in any of the foregoing embodiments, such as a hard disk or a memory of the terminal.
  • the computer readable storage medium may also be an external storage device of the terminal, such as a plug-in hard disk equipped on the terminal, a smart memory card (SMC), and a Secure Digital (SD) card. Wait.
  • the computer readable storage medium may also include both an internal storage unit of the terminal and an external storage device.
  • the disclosed terminal and method may be implemented in other manners.
  • the terminal embodiment described above is only illustrative.
  • the division of the unit is only a logical function division, and the actual implementation may have another division manner.
  • a person skilled in the art can clearly understand that, for the convenience and brevity of the description, the specific working process of the terminal and the unit described above can be referred to the corresponding process in the foregoing method embodiment, and details are not described herein again.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文本数据质检方法、装置、设备及计算机可读存储介质。所述方法包括:获取消息级别的对话文本数据和会话级别的会话文本数据并进行预处理;根据预先设定的质检点和质检点对应的规则,分别利用全文搜索引擎和预设神经网络模型,从预处理后的对话文本数据和会话文本数据中标记出质检点;根据预设规则将全文搜索引擎标记出的质检点结果与预设神经网络模型标记出的质检点结果进行整合,以作为最终的质检点结果。

Description

文本数据质检方法、装置、设备及计算机可读存储介质
本申请要求于2018年3月22日提交中国专利局、申请号为201810240050.5、发明名称为“文本数据质检方法、装置、设备及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,尤其涉及一种文本数据质检方法、装置、设备及计算机可读存储介质。
背景技术
在坐席销售的过程中,可能会与客户产生大量的对话文本,这些对话文本会保存在坐席销售的平台中。若要想对对话文本进行分析,目前采用的方法是先随机抽取一定条数的文本内容,再通过人工的方法进行分析。然而机器随机抽取的结果中,可能包含大量合规的文本内容。这样不仅使得分析人员工作效率不高,而且会遗漏大量的文本内容。如果遗漏的文本内容包含不合规(存在错误的地方)的内容,如重大违规项,则会引起客户不满,带来不小影响。
发明内容
本申请实施例提供一种文本数据质检方法、装置、设备及计算机可读存储介质,可将与全文搜索引擎匹配得到的质检结果,和利用预设神经网络模型分类出的质检结果,按照预设规则结合,以提高质检效率和准确率。
第一方面,本申请实施例提供了一种文本数据质检方法,该方法包括:
获取消息级别的对话文本数据和会话级别的会话文本数据;将所述对话文本数据和会话文本数据进行预处理;根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据和会话文本数据中查询出匹配的质检点并在预处理后的会话文本数据中标记;利用预设神经网络模型,将预处理后的对话文本数据和会话文本数据中的数据进行分类并在预处理后的会话文本数据标记出分类的质检点;在预处理后的会话文本数据中,根据预设 规则将全文搜索引擎标记出的质检点结果与预设神经网络模型标记出的质检点结果整合,并将整合后的质检点结果作为最终的质检点结果。
第二方面,本申请实施例提供了一种文本数据质检装置,该装置包括用于执行上述第一方面所述的文本数据质检方法的单元。
第三方面,本申请实施例提供了一种计算机设备,所述计算机设备包括存储器,以及与所述存储器相连的处理器;所述存储器用于存储计算机程序,所述处理器用于运行所述存储器中存储的计算机程序,以执行上述第一方面所述的文本数据质检的方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令被处理器执行时,实现上述第一方面所述的文本数据质检的方法。
本申请实施例可将与全文搜索引擎匹配得到的质检结果,和利用预设神经网络模型分类出的质检结果,按照预设规则结合,以提高文本数据质检效率和准确率。
附图说明
图1是本申请实施例提供的一种文本数据质检方法的流程示意图;
图2是本申请实施例提供的一种文本数据质检方法的子流程示意图;
图3是本申请实施例提供的一种文本数据质检方法的另一子流程示意图;
图4是本申请实施例提供的一种文本数据质检方法的另一子流程示意图;
图5是本申请实施例提供的一种文本数据质检方法的另一子流程示意图;
图6是本申请实施例提供的一种文本数据质检装置的示意性框图;
图7是本申请实施例提供的第一标记单元的示意性框图;
图8是本申请实施例提供的训练单元的示意性框图;
图9是本申请实施例提供的第二标记单元的示意性框图;
图10是本申请实施例提供的整合单元的示意性框图;
图11是本申请实施例提供的一种计算机设备的示意性框图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳 动前提下所获得的所有其他实施例,都属于本申请保护的范围。
在本申请中,应当理解,尽管术语第一、第二等可以在此用来描述各种元素,但这些元素不应该受限于这些术语。这些术语仅用来将这些元素彼此区分开。例如,在不脱离本申请范围的前提下,第一获取单元可以被称为第二获取单元,并且类似地,第二获取单元可以被称为第一获取单元。第一获取单元和第二获取单元均为获取单元,但它们并非同一获取单元。
图1为本申请实施例提供的一种文本数据质检方法的流程示意图。该方法包括以下步骤S101-S106。
S101,获取消息级别的对话文本数据和会话级别的会话文本数据。其中,消息级别的对话文本数据从坐席销售的平台中得到,该对话文本数据中保存的是坐席与客户之间的对话文本。该对话文本数据属于消息级别,可以理解为对话文本数据是以坐席与客户之间发送的消息为单位保存的数据,该对话文本数据由众多的消息文本数据组成,每一条消息文本数据包括消息编号、发送人、接收人、具体的消息内容、发送消息的时间等。会话级别的会话文本数据理解为以坐席与客户之间的一个对话(会话)为单位保存的数据,即会话文本数据中保存的是坐席与客户之间的多个对话数据,如每个对话数据中包括对话编号、对话内容。每个对话内容中对应有多条消息文本数据。消息级别的对话文本数据由于是以消息为单位保存的数据,所以会存在散乱无序、无上下文关系、无人员关系等,不便于用户查看。
会话级别的会话文本数据可以根据消息级别的对话文本数据加工处理后得到,具体加工处理的流程包括:对消息级别的对话文本数据进行预处理,如去重等;从预处理后的消息级别的对话文本数据中找出每条消息文本数据中的发送人和接收人;将发送人和接收人作为一个集合,按照集合对对话文本数据中的消息文本数据进行分组,将集合相同的消息文本数据分成一组,如此就分成了多组的数据,这意味分成一组的发送人和接收人是同一个对话中的两个人,不同对话分成了不同的组;将每组中的消息文本数据按照按照时间先后顺序进行排序,并将排序后的消息文本数据按照预定格式显示,如预定格式可以为:发送消息的时间[空格]发送人[冒号]具体的消息内容。如2017-01-01 12:01:02张三:李老师,在吗?可以理解为,会话级别的会话文本数据是将消息级别的对话文本数据中的消息文本数据按照时间的先后顺序和发送人接收人信息整理后的以对话为单位的多个对话数据。其中,消息级别的对话文本数据和会话级别的会话文本数据是以数据表的形式保存在数据库中,如Oracle数据库等。消息 级别的对话文本数据和会话级别的会话文本数据根据数据量的多少,分别可以保存为多个数据表,也可以保存为一个数据表。
S102,将对话文本数据和会话文本数据进行预处理。预处理的方法包括替换,过滤等。替换包括将对话文本数据和会话文本数据中对应消息文本数据中的英文替换为中文等;过滤包括将对话文本数据和会话文本数据中对应消息文本数据中的数字、标点符号、表情、乱码过滤掉。将对话文本数据和会话文本数据中的消息文本数据进行预处理,以保留消息文本数据中的具体的消息内容中的纯文本消息,方便后续的处理。
S103,根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据和会话文本数据中查询出匹配的质检点并进行标记。其中,质检点可以理解为不合规或者违规的地方,也就是存在错误的地方。每个质检点有质检点标识,如A47,表示A类中第47个质检点,质检点对应的规则包括关键词和逻辑运算。质检点和质检点对应的规则,举例如:A47,基金and分红。其中,关键词包括:基金、分红,逻辑运算包括and。A47质检点表示,若一条消息中同时出现了基金和分红,那么认为该条消息违规了。可以理解为,基金这个产品不涉及分红,也可以理解为,当说到基金这个产品时不会想到会说分红,若一条消息中同时出现基金和分红,那么这条消息违规,也就是出现错误。全文搜索引擎,指的是ElasticSearch(简写为ES)搜索引擎。ES使用关键词,并利用ES的API接口must,should,must not组合分装实现质检点对应的规则,从对话文本数据和会话文本数据中进行查询搜索,找出匹配的质检点并标记,其中标记用质检点标识来标记。分别从对话文本数据和会话文本数据中进行查询搜索,可以理解为,有一些质检点可以通过消息级别的文本表现出来,那么从对话文本数据中进行查询搜索;有一些质检点单从消息级别的文本中得不到,而需要从前后多个消息级别的文本中才可以得到,那么需要从会话文本数据中进行查询搜索。
在一实施例中,如图2所示,所述步骤S103包括以下步骤S201-S203。
S201,将预处理后的对话文本数据和会话文本数据进行分词。通过全文搜索引擎中的分词将预处理后的对话文本数据和会话文本数据中每个消息文本数据中的具体消息内容分成多个单词,如消息“我来到北京清华大学”,分词的结果为“我来到北京清华大学”。
S202,对分词后的数据建立倒排索引。具体地,统计分成的词在对对话文本数据和会话文本数据中出现的次数和位置;根据出现的次数和位置对分成的 词进行倒排索引。如统计词“分红”在对话文本数据和会话文本数据中出现的次数和位置,其中,在对话文本数据中的位置包括在哪个对话文本数据表、哪个消息文本数据(可以用消息编号来表示),在会话文本数据中的位置包括在哪个会话文本数据表、哪段会话(可以用对话编号表示)等。其中,倒排索引是实现“单词-文档矩阵”的一种存储形式,通过倒排索引,可以根据单词快速获取包含这个单词的“文档列表”。如在会话文本数据中,通过该倒排索引可以根据分成的词快速获取包含这个词的对话列表,如哪些对话中出现了该词。
S203,根据预先设定的质检点和质检点对应的规则,利用建立的倒排索引和全文搜索引擎,从对话文本表和会话文本表中查询出匹配的质检点并进行标记。根据质检点对应的规则查询搜索到匹配的质检点后,进行质检点标记。如对话文本数据中标记可以理解为在对话文本数据中的每一个消息文本数据后进行相应的质检点标记,在会话文本数据中标记可以理解为在会话文本数据中每一个对话内容对应的多个消息文本数据后进行相应的质检点标记。建立倒排索引后,可以加快查询匹配的速度。在数据量很大的情况下,仍能快速的完成质检点的查询匹配和标记。
S104,从训练过的神经网络模型中选取其中一个作为预设神经网络模型。
其中,从训练过的神经网络模型中选取其中一个作为预设神经网络模型包括:从多个不同的训练过的神经网络模型中选择出分类准确率最高的神经网络模型作为预设神经网络模型。其中,多个神经网络模型可以为长短时神经网络模型(LSTM,Long Short-Term Memory)(简称为模型1)、长短时神经网络模型结合注意力机制(Attention Mechanism)(简称为模型2)、双向长短时记忆神经网络(Bidirectional Long Short-term Memory,BLSTM)(简称为模型3)、双向长短期记忆神经网络结合注意力机制(简称为模型4)等,也可以是其他的适用的神经网络模型。
获取训练过的模型1、模型2、模型3、模型4后,获取测试样本集,如可以预先获取一定数量的含有质检点的对话文本数据和会话文本数据作为测试样本集,其中,含有质检点的对话文本数据和会话文本数据,可以通过利用全文搜索引擎对对话文本数据和会话文本数据进行标记并保存得到,也可以结合人工质检的结果;利用测试样本集,对模型1、模型2、模型3、模型4进行质检点分类并标记出分类的质检点;计算每个模型的分类准确率,其中,准确率计算公式为:(A i∩B)/C,C表示人工复核样本数量,A表示人工复核样本数量中模型质检点个数,B表示人工复核样本数量中人工质检点个数,i∈(1,2,3,4),分 别表示对应模型;选择分类准确率较高的模型作为预设神经网络模型。可以理解地,从多个不同的神经网络模型中选择出分类准确率较高的模型作为预设神经网络模型,来对对话文本数据和会话文本数据进行分类并标记出分类的质检点,以提高模型分类质检点的准确率。
在其他实施例中,从多个不同的训练过的神经网络模型中选择出分类准确率最高的神经网络模型作为预设神经网络模型之前,还包括对多个不同神经网络模型进行训练,如可以预先获取一定数量的含有质检点的对话文本数据和会话文本数据作为训练样本集,需要注意的是,训练样本集与测试样本集不含有相同的数据,一般来说,训练样本集的数据大于测试样本集的数据;利用训练样本集,同时训练模型1、模型2、模型3和模型4,得到训练过的模型1、模型2、模型3、模型4。
需要指出的是,在另一些实施例中,从训练过的神经网络模型中选取其中一个作为预设神经网络模型包括:将训练过的一个神经网络模型作为预设神经网络模型。可以理解地,只选择一个神经网络模型进行训练,将训练好的神经网络模型作为预设神经网络模型。如此,就无需从多个神经网络模型中选择出分类准确率最高的神经网络模型作为预设神经网络模型。然而,无论是一个神经网络模型,还是从多个神经网络模型中选择一个分类准确率最高的神经网络模型,都需要对神经网络模型进行训练的过程。
在一实施例中,如图3所示,训练神经网络模型的过程包括以下步骤S301-S304。
S301,获取含有质检点的对话文本数据和会话文本数据。其中,若含有质检点的对话文本数据和会话文本数据没有经过预处理,那么还需要进行预处理,预处理的方法包括替换,过滤等。替换包括将对话文本数据和会话文本数据中对应消息文本数据中的英文替换为中文等;过滤包括将对话文本数据和会话文本数据中对应消息文本数据中的数字、标点符号、表情、乱码过滤掉等。获取含有质检点的对话文本数据和会话文本数据作为训练样本集。
S302,利用分词工具对含有质检点的对话文本数据和会话文本数据中的文本信息进行分词。其中,分词工具可以为结巴分词,利用结巴分词对对话文本数据和会话文本数据中每个消息文本数据分成多个单词。结巴分词支持三种模式:一,精确模式,试图将句子最精确地切开,适合文本分析;二,全模式,把句子中所有的可以成词的词语都扫描出来,速度非常快,但是不能解决歧义;三,搜索引擎模式,在精确模式的基础上,对长词再次切分,提高召回率,适 合用于搜索引擎分词。在本实施例中,选用精确模式对含有质检点的对话文本数据和会话文本数据进行分词。通过分词将对话文本数据每个消息文本数据分成多个单词,如消息“我来到北京清华大学”,分词的结果为“我来到北京清华大学”。
S303,利用预设词向量模型对分词后的数据进行处理,得到对应的词向量。其中,词向量(word embedding)模型指的是gensim的word2vec词向量模型。其中,gensim是一个通过衡量词组(或更高级结构,如整句或文档)模式来挖掘文档语义结构的工具。gensim以文本文档的集合作为输入,并生成一个“向量”来表征该文集的文本内容,从而实现语义挖掘,该向量表示可被用于训练一个“模型”。word2vec是gensim的一个“模型”,可以计算词向量。word2vec实际上是个浅层的神经网络,word2vec可以在百万数量级的词典和上亿的数据集上进行高效地训练,训练得到的训练结果为词向量,可以很好地度量词与词之间的相似性。
预设词向量模型通过预先训练得到,训练词向量模型的过程如下:获取对含有质检点的对话文本数据和会话文本数据,其中,若含有质检点的对话文本数据和会话文本数据没有经过预处理,那么还需要进行预处理,预处理的方法包括替换,过滤等;对含有质检点的对话文本数据和会话文本数据进行分词,其中,可以使用结巴分词工具的精确模式对含有质检点的对话文本数据和会话文本数据进行分词;设置训练word2vec词向量模型的参数,如最小次数min_count=5,该最小次数表示小于5次的单词会被丢弃,神经网络隐藏层的单元数size=128,迭代的次数iterator=5等;将分词后的数据作为训练数据集,训练word2vec词向量模型得到预设词向量模型。
其中,需要指出的是,用来训练词向量模型的数据量通常非常大,要远大于训练神经网络模型所需要的数据量。在实际使用中,可以先对大量的含有质检点的对话文本数据和会话文本数据进行分词,使用分词后的数据来训练词向量模型得到预设词向量模型,然后从分词后的数据中拿出一部分数据,来训练神经网络模型,以得到预设神经网络模型。可以理解地,训练词向量模型和训练神经网络模型的数据可以是同一批的数据,在其他实施例中也可以是不同批的数据,即训练词向量模型的数据和训练神经网络模型的数据不一样。
S304,根据词向量和对应的质检点,训练神经网络模型。具体地:将词向量和对应的质检点输入,训练神经网络,如若神经网络模型是长短时神经网络模型,那么训练长短时神经网络,若神经网络模型是双向长短时记忆神经网络 模型,那么训练双向长短时神经网络;将神经网络各个节点输出的数据输入到平均池化层,以融合神经网络各个节点的结果;再将经过平均池化层后的数据输入到softmax函数,以得到分类结果,最终使得到的分类结果和标记的质检点结果尽可能多的相同。
需要指出的是,每隔一段时间,如一个星期,半个月等,需要对预设神经网络模型或者对预设词向量模型进行更新,具体地,以新的含有质检点的对话文本数据和会话文本数据作为输入来进行更新,以使预设神经网络模型或者词向量模型总能适应新的数据的变化。
S105,利用预设神经网络模型,将预处理后的对话文本数据和会话文本数据中的数据进行分类并标记出分类的质检点。
在一实施例中,如图4所示,所述步骤S105包括以下步骤S401-S404。
S401,获取预处理后的对话文本数据和会话文本数据。
S402,利用分词工具对预处理后的对话文本数据和会话文本数据中的文本信息进行分词。其中,分词工具可以为结巴分词,利用结巴分词对对话文本数据和会话文本数据中每个消息文本数据分成多个单词。具体地,选用结巴分词中的精确模式对含有质检点的对话文本数据和会话文本数据进行分词。通过分词将对话文本数据每个消息文本数据分成多个单词。
S403,利用预设词向量模型对分词后的数据进行处理,得到对应的词向量。
S404,根据对应的词向量,利用预设神经网络模型进行分类,得到分类出的质检点并标记出质检点。对话文本数据中标记可以理解为在对话文本数据中的每一个消息文本数据后进行相应的质检点标记,在会话文本数据中标记可以理解为在会话文本数据中每一个对话内容对应的多个消息文本数据后进行相应的质检点标记。
S106,根据预设规则将全文搜索引擎标记出的质检点结果与预设神经网络模型标记出的质检点结果整合,并将整合后的质检点结果作为最终的质检点结果。
在一实施例中,如图5所示,所述步骤S106包括如下步骤S501-S504。
S501,将利用全文搜索引擎在预处理后的对话文本数据和会话文本数据中标记出的质检点进行合并,得到全文搜索引擎标记出的质检点结果。具体地,将利用全文搜索引擎标记出的含有质检点的对话文本数据进行加工处理得到加工后的会话文本数据,并将加工后的会话文本数据和利用全文搜索引擎标记出的含有质检点的会话文本数据进行合并,得到全文搜索引擎标记出的质检点结 果,其中,加工后的会话文本数据中是以消息文本数据为单位标记的质检点。全文搜索引擎标记出的质检点结果中有以消息文本数据为单位标记的质检点,也有以对话为单位标记的质检点。全文搜索引擎标记出的质检点结果最终是在会话文本数据中显示的,可以理解为,在会话文本数据中显示可以方便用户进一步的查阅相关内容。其中,加工处理的过程主要包括:从利用全文搜索引擎标记出的含有质检点的对话文本数据中找出每条消息文本数据中的发送人和接收人;将发送人和接收人作为一个集合,按照集合对对话文本数据中的消息文本数据进行分组;将每组中的消息文本数据按照时间先后顺序进行排序,并将排序后的消息文本数据按照预定格式显示,如预定格式可以为:发送消息的时间[空格]发送人[冒号]具体的消息内容。可以理解为,加工后的会话文本数据是将对话文本数据中的消息文本数据按照时间的先后顺序整理后的以对话为单位的多个对话数据。
S502,将利用预设神经网络模型在预处理后的对话文本数据和会话文本数据中标记出的质检点进行合并,得到预设神经网络模型标记出的质检点结果。具体地,将利用预设神经网络模型标记出的含有质检点的对话文本数据进行加工处理得到加工后的会话文本数据,并将加工后的会话文本数据和利用预设神经网络模型标记出的含有质检点的会话文本数据进行合并,得到预设神经网络模型标记出的质检点结果,其中,加工后的会话文本数据中有以消息文本数据为单位标记的质检点。预设神经网络模型标记出的质检点结果中有以消息文本数据为单位标记的质检点,也有以对话为单位标记的质检点。预设神经网络模型标记出的质检点结果最终是在会话文本数据中显示的,可以理解为,在会话文本数据中显示可以方便用户进一步的查阅相关内容。其中,加工处理的过程如前述所述,在此不再赘述。
S503,根据预设规则将全文搜索引擎标记出的质检点结果与预设神经网络模型标记出的质检点结果整合。其中,预设规则可以为合并,可以理解为进行逻辑与运算,即将全文搜索引擎标记出的质检点结果与预设神经网络模型标记出的质检点结果进行逻辑与运算。如对于对话文本数据中的某一条消息文本数据,利用全文搜索引擎标记出的质检点结果为A47,而利用预设神经网络模型标记出的质检点结果为B16,进行逻辑与运算后,该条消息文本数据的质检点结果为A47,B16;如对于对话文本数据中的某一条消息文本数据,利用全文搜索引擎标记出的质检点结果为A47,而利用预设神经网络模型标记出的质检点结果为空,那么进行逻辑与运算后,该条消息文本数据的质检点结果为A47。 预设规则也可以是选择两者中准确率较高的质检点结果作为整合后的质检点结果。如若利用全文搜索引擎标记出的质检点结果准确率低于利用预设神经网络模型标记出的质检点结果的准确率,那么选择预设神经网络模型标记出的质检点结果作为整合后的质检点结果;若利用全文搜索引擎标记出的质检点结果准确率不低于利用预设神经网络模型标记出的质检点结果的准确率,那么选择全文搜索引擎标记出的质检点结果作为整合后的质检点结果。
S504,将整合后的质检点结果作为最终的质检点结果。
在其他实施例中,也可以根据预设规则将利用全文搜索引擎在预处理后的对话文本数据中标记出的质检点和利用预设神经网络模型在预处理后的对话文本数据标记出的质检点进行处理,再根据预设规则将利用全文搜索引擎在预处理后的会话文本数据中标记出的质检点和利用预设神经网络模型在预处理后的会话文本数据标记出的质检点进行处理,将两次处理后的结果进行合并,以得到最终的质检点结果。
以上实施例,通过利用全文搜索引擎和预设神经网络模型从预处理后的对话文本和会话文本中标记出质检点,避免了机器随机抽取、人工抽检的方法而造成的只对一部分数据进行处理,而遗漏其他可能的质检点。该方案可对全部数据进行处理,高效质检全部数据,提高质检效率。同时利用全文搜索引擎和预设神经网络模型,这两个不同模型从预处理后的对话文本和会话文本中标记出质检点,并将两个不同模型的质检点进行整合,找出所有可能的质检点,提高质检的准确率。
图6是本申请实施例提供的一种文本数据质检装置的示意性框图。如图6所示,该装置60包括获取单元601、预处理单元602、第一标记单元603、选择单元604、第二标记单元605、整合单元606、训练单元607。
获取单元601,用于,获取消息级别的对话文本数据和会话级别的会话文本数据。
预处理单元602,用于将对话文本数据和会话文本数据进行预处理。预处理的方法包括替换,过滤等。
第一标记单元603,用于根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据和会话文本数据中查询出匹配的质检点并进行标记。
在一实施例中,如图7所示,第一标记单元603包括数据分词单元701、索引单元702、查询标记单元703。
数据分词单元701,用于将预处理后的对话文本数据和会话文本数据进行分词。
索引单元702,用于对分词后的数据建立倒排索引。
查询标记单元703,用于根据预先设定的质检点和质检点对应的规则,利用建立的倒排索引和全文搜索引擎,从对话文本表和会话文本表中查询出匹配的质检点并进行标记。
选择单元604,用于从训练过的神经网络模型中选取其中一个作为预设神经网络模型。
在其他实施例中,装置60还包括训练单元607。
如图8所示,训练单元607包括第一获取单元801、第一分词单元802、第一词向量单元803、模型训练单元804。
第一获取单元801,用于获取含有质检点的对话文本数据和会话文本数据。第一分词单元802,用于利用分词工具对含有质检点的对话文本数据和会话文本数据中的文本信息进行分词。
第一词向量单元803,用于利用预设词向量模型对分词后的数据进行处理,得到对应的词向量。
预设词向量模型通过预先训练得到,即在其他实施例中,训练单元还包括预设词向量获取单元,预设词向量获取单元用于训练词向量模型以得到预设词向量模型。具体地,预设词向量获取单元包括质检数据获取单元、质检数据分词单元、设置单元、词向量训练单元。
模型训练单元804,用于根据词向量和对应的质检点,训练神经网络模型。
第二标记单元605,用于利用预设神经网络模型,将预处理后的对话文本数据和会话文本数据中的数据进行分类并标记出分类的质检点。
在一实施例中,如图9所示,第二标记单元605包括第二获取单元901、第二分词单元902、第二词向量单元903、分类单元904。
第二获取单元901,用于获取预处理后的对话文本数据和会话文本数据。
第二分词单元902,用于利用分词工具对预处理后的对话文本数据和会话文本数据中的文本信息进行分词。
第二词向量单元903,用于利用预设词向量模型对分词后的数据进行处理,得到对应的词向量。
在其他实施例中,第二标记单元还包括预设词向量获取单元,预设词向量获取单元用于训练词向量模型以得到预设词向量模型。具体地,预设词向量获 取单元包括质检数据获取单元、质检数据分词单元、设置单元、词向量训练单元。具体地,请参看训练单元中预设词向量获取单元部分的描述。
分类单元904,用于根据对应的词向量,利用预设神经网络模型进行分类,得到分类出的质检点并标记出质检点。
整合单元606,用于根据预设规则将全文搜索引擎标记出的质检点结果与预设神经网络模型标记出的质检点结果整合,并将整合后的质检点结果作为最终的质检点结果。
在一实施例中,如图10所示,整合单元606包括第一合并单元101、第二合并单元102、结果整合单元103、质检点结果确定单元104。
第一合并单元101,用于将利用全文搜索引擎在预处理后的对话文本数据和会话文本数据中标记出的质检点进行合并,得到全文搜索引擎标记出的质检点结果。
第二合并单元102,用于将利用预设神经网络模型在预处理后的对话文本数据和会话文本数据中标记出的质检点进行合并,得到预设神经网络模型标记出的质检点结果。
结果整合单元103,用于根据预设规则将全文搜索引擎标记出的质检点结果与预设神经网络模型标记出的质检点结果整合。
质检点结果确定单元104,用于将整合后的质检点结果作为最终的质检点结果。
上述装置实施例的具体工作过程和达到的有益效果,请参看前述方法实施例对应的实施过程和有益效果,再次不再赘述。
上述装置可以实现为一种计算机程序的形式,计算机程序可以在如图11所示的计算机设备上运行。
图11为本申请实施例提供的一种计算机设备的示意性框图。该计算机设备110可以是手机、pad等便携式设备,也可以是台式机等非便携式设备。该设备110包括通过系统总线111连接的处理器112、存储器和网络接口113,其中,存储器可以包括非易失性存储介质114和内存储器115。
该非易失性存储介质114可存储操作系统1141和计算机程序1142。该计算机程序1142被执行时,可使得处理器112执行一种文本数据质检方法。该处理器112用于提供计算和控制能力,支撑整个设备110的运行。该内存储器115为非易失性存储介质中的计算机程序的运行提供环境,该计算机程序被处理器112执行时,可使得处理器112执行一种文本数据质检方法。该网络接口113 用于进行网络通信,如获取数据等。本领域技术人员可以理解,图11中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的设备110的限定,具体的设备110可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
其中,所述处理器112用于运行存储在存储器中的计算机程序,以实现前述文本数据质检方法的任一实施例。
应当理解,在本申请实施例中,所称处理器112可以是中央处理单元(Central Processing Unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
在本申请的另一实施例中提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时,以实现前述文本数据质检方法的任一实施例。
所述计算机可读存储介质可以是前述任一实施例所述的终端的内部存储单元,例如终端的硬盘或内存。所述计算机可读存储介质也可以是所述终端的外部存储设备,例如所述终端上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡等。进一步地,所述计算机可读存储介质还可以既包括所述终端的内部存储单元也包括外部存储设备。
在本申请所提供的几个实施例中,应该理解到,所揭露的终端和方法,可以通过其它的方式实现。例如,以上所描述的终端实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式。所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,上述描述的终端和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。

Claims (20)

  1. 一种文本数据质检方法,其特征在于,所述方法包括:
    获取消息级别的对话文本数据和会话级别的会话文本数据;
    将所述对话文本数据和会话文本数据进行预处理;
    根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据和会话文本数据中查询出匹配的质检点并进行标记;
    利用预设神经网络模型,将预处理后的对话文本数据和会话文本数据中的数据进行分类并标记出分类的质检点;
    根据预设规则将全文搜索引擎标记出的质检点结果与预设神经网络模型标记出的质检点结果进行整合,并将整合后的质检点结果作为最终的质检点结果。
  2. 根据权利要求1所述的方法,其特征在于,在所述利用预设神经网络模型,将预处理后的对话文本数据和会话文本数据中的数据进行分类并标记出分类的质检点之前,所述方法还包括训练神经网络模型并从训练过的神经网络模型中选取其中一个作为预设神经网络模型,其中,训练神经网络模型,包括:
    获取含有质检点的对话文本数据和会话文本数据;
    利用分词工具对含有质检点的对话文本数据和会话文本数据中的文本信息进行分词;
    利用预设词向量模型对分词后的数据进行处理,得到对应的词向量;
    根据词向量和对应的质检点,训练神经网络模型。
  3. 根据权利要求2所述的方法,其特征在于,从训练过的神经网络模型中选取其中一个作为预设神经网络模型,包括:
    从多个不同的训练过的神经网络模型中选择出分类准确率最高的神经网络模型作为预设神经网络模型。
  4. 根据权利要求1所述的方法,其特征在于,所述利用预设神经网络模型,将预处理后的对话文本数据和会话文本数据中的数据进行分类并标记出分类的质检点,包括:
    获取预处理后的对话文本数据和会话文本数据;
    利用分词工具对预处理后的对话文本数据和会话文本数据中的文本信息进行分词;
    利用预设词向量模型对分词后的数据进行处理,得到对应的词向量。
    根据对应的词向量,利用预设神经网络模型进行分类,得到分类出的质检 点并标记出质检点。
  5. 根据权利要求4所述的方法,其特征在于,获取预设词向量模型包括:
    获取对含有质检点的对话文本数据和会话文本数据进行分词后的数据;
    设置训练词向量模型的参数;
    将分词后的数据作为训练数据集,训练词向量模型后得到预设词向量模型。
  6. 根据权利要求1所述的方法,其特征在于,所述根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据和会话文本数据中查询出匹配的质检点并进行标记,包括:
    将预处理后对话文本数据和会话文本数据进行分词;
    对分词后的数据建立倒排索引;
    根据预先设定的质检点和质检点对应的规则,利用建立的倒排索引和全文搜索引擎,从对话文本表和会话文本表中查询出匹配的质检点并进行标记。
  7. 根据权利要求1所述的方法,其特征在于,所述根据预设规则将全文搜索引擎标记出的质检点结果与预设神经网络模型标记出的质检点结果整合,并将整合后的质检点结果作为最终的质检点结果,包括:
    将利用全文搜索引擎在预处理后的对话文本数据和会话文本数据中标记出的质检点进行合并,得到全文搜索引擎标记出的质检点结果;
    将利用预设神经网络模型在预处理后的对话文本数据和会话文本数据中标记出的质检点进行合并,得到预设神经网络模型标记出的质检点结果;
    根据预设规则将全文搜索引擎标记出的质检点结果与预设神经网络模型标记出的质检点结果整合;
    将整合后的质检点结果作为最终的质检点结果。
  8. 一种文本数据质检装置,其特征在于,所述文本数据质检装置包括:
    获取单元,用于获取消息级别的对话文本数据和会话级别的会话文本数据;
    预处理单元,用于将所述对话文本数据和会话文本数据进行预处理;
    第一标记单元,用于根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据和会话文本数据中查询出匹配的质检点并进行标记;
    第二标记单元,用于利用预设神经网络模型,将预处理后的对话文本数据和会话文本数据中的数据进行分类并标记出分类的质检点;
    整合单元,用于根据预设规则将全文搜索引擎标记出的质检点结果与预设神经网络模型标记出的质检点结果进行整合,并将整合后的质检点结果作为最 终的质检点结果。
  9. 根据权利要求8所述的文本数据质检装置,其特征在于,所述第二标记单元包括:
    第二获取单元,用于获取预处理后的对话文本数据和会话文本数据;
    第二分词单元,用于利用分词工具对预处理后的对话文本数据和会话文本数据中的文本信息进行分词;
    第二词向量单元,用于利用预设词向量模型对分词后的数据进行处理,得到对应的词向量;
    分类单元,用于根据对应的词向量,利用预设神经网络模型进行分类,得到分类出的质检点并标记出质检点。
  10. 根据权利要求8所述的文本数据质检装置,其特征在于,所述整合单元包括:
    第一合并单元,用于将利用全文搜索引擎在预处理后的对话文本数据和会话文本数据中标记出的质检点进行合并,得到全文搜索引擎标记出的质检点结果;
    第二合并单元,用于将利用预设神经网络模型在预处理后的对话文本数据和会话文本数据中标记出的质检点进行合并,得到预设神经网络模型标记出的质检点结果;
    结果整合单元,用于根据预设规则将全文搜索引擎标记出的质检点结果与预设神经网络模型标记出的质检点结果整合;
    质检点结果确定单元,用于将整合后的质检点结果作为最终的质检点结果。
  11. 一种计算机设备,其特征在于,所述计算机设备包括存储器,以及与所述存储器相连的处理器;
    所述存储器用于存储计算机程序;所述处理器用于运行所述存储器中存储的计算机程序,以执行如下步骤:
    获取消息级别的对话文本数据和会话级别的会话文本数据;
    将所述对话文本数据和会话文本数据进行预处理;
    根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据和会话文本数据中查询出匹配的质检点并进行标记;
    利用预设神经网络模型,将预处理后的对话文本数据和会话文本数据中的数据进行分类并标记出分类的质检点;
    根据预设规则将全文搜索引擎标记出的质检点结果与预设神经网络模型标 记出的质检点结果进行整合,并将整合后的质检点结果作为最终的质检点结果。
  12. 根据权利要求11所述的计算机设备,其特征在于,在所述利用预设神经网络模型,将预处理后的对话文本数据和会话文本数据中的数据进行分类并标记出分类的质检点之前,所述处理器还执行如下步骤:训练神经网络模型并从训练过的神经网络模型中选取其中一个作为预设神经网络模型;其中,所述处理器执行所述训练神经网络模型时,具体执行如下步骤:
    获取含有质检点的对话文本数据和会话文本数据;
    利用分词工具对含有质检点的对话文本数据和会话文本数据中的文本信息进行分词;
    利用预设词向量模型对分词后的数据进行处理,得到对应的词向量;
    根据词向量和对应的质检点,训练神经网络模型。
  13. 根据权利要求12所述的计算机设备,其特征在于,所述处理器在执行所述从训练过的神经网络模型中选取其中一个作为预设神经网络模型时,具体执行如下步骤:从多个不同的训练过的神经网络模型中选择出分类准确率最高的神经网络模型作为预设神经网络模型。
  14. 根据权利要求11所述的计算机设备,其特征在于,所述处理器在执行所述利用预设神经网络模型,将预处理后的对话文本数据和会话文本数据中的数据进行分类并标记出分类的质检点时,具体执行如下步骤:
    获取预处理后的对话文本数据和会话文本数据;
    利用分词工具对预处理后的对话文本数据和会话文本数据中的文本信息进行分词;
    利用预设词向量模型对分词后的数据进行处理,得到对应的词向量;
    根据对应的词向量,利用预设神经网络模型进行分类,得到分类出的质检点并标记出质检点。
  15. 根据权利要求14所述的计算机设备,其特征在于,所述处理器还执行如下步骤:获取预设词向量模型,所述处理器在执行所述获取预设词向量模型时,具体执行如下步骤:
    获取对含有质检点的对话文本数据和会话文本数据进行分词后的数据;
    设置训练词向量模型的参数;
    将分词后的数据作为训练数据集,训练词向量模型后得到预设词向量模型。
  16. 根据权利要求11所述的计算机设备,其特征在于,所述处理器在执行所述根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据和会话文本数据中查询出匹配的质检点并进行标记时,具体执行如下步骤:
    将预处理后对话文本数据和会话文本数据进行分词;
    对分词后的数据建立倒排索引;
    根据预先设定的质检点和质检点对应的规则,利用建立的倒排索引和全文搜索引擎,从对话文本表和会话文本表中查询出匹配的质检点并进行标记。
  17. 根据权利要求11所述的计算机设备,其特征在于,所述处理器在执行所述根据预设规则将全文搜索引擎标记出的质检点结果与预设神经网络模型标记出的质检点结果整合,并将整合后的质检点结果作为最终的质检点结果时,具体执行如下步骤:
    将利用全文搜索引擎在预处理后的对话文本数据和会话文本数据中标记出的质检点进行合并,得到全文搜索引擎标记出的质检点结果;
    将利用预设神经网络模型在预处理后的对话文本数据和会话文本数据中标记出的质检点进行合并,得到预设神经网络模型标记出的质检点结果;
    根据预设规则将全文搜索引擎标记出的质检点结果与预设神经网络模型标记出的质检点结果整合;
    将整合后的质检点结果作为最终的质检点结果。
  18. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令被处理器执行时,实现如下步骤:
    获取消息级别的对话文本数据和会话级别的会话文本数据;
    将所述对话文本数据和会话文本数据进行预处理;
    根据预先设定的质检点和质检点对应的规则,利用全文搜索引擎,从预处理后的对话文本数据和会话文本数据中查询出匹配的质检点并进行标记;
    利用预设神经网络模型,将预处理后的对话文本数据和会话文本数据中的数据进行分类并标记出分类的质检点;
    根据预设规则将全文搜索引擎标记出的质检点结果与预设神经网络模型标记出的质检点结果进行整合,并将整合后的质检点结果作为最终的质检点结果。
  19. 根据权利要求18所述的计算机可读存储介质,其特征在于,所述处理 器在执行所述利用预设神经网络模型,将预处理后的对话文本数据和会话文本数据中的数据进行分类并标记出分类的质检点时,具体实现如下步骤:
    获取预处理后的对话文本数据和会话文本数据;
    利用分词工具对预处理后的对话文本数据和会话文本数据中的文本信息进行分词;
    利用预设词向量模型对分词后的数据进行处理,得到对应的词向量;
    根据对应的词向量,利用预设神经网络模型进行分类,得到分类出的质检点并标记出质检点。
  20. 根据权利要求18所述的计算机可读存储介质,其特征在于,所述处理器在执行所述根据预设规则将全文搜索引擎标记出的质检点结果与预设神经网络模型标记出的质检点结果整合,并将整合后的质检点结果作为最终的质检点结果时,具体实现如下步骤:
    将利用全文搜索引擎在预处理后的对话文本数据和会话文本数据中标记出的质检点进行合并,得到全文搜索引擎标记出的质检点结果;
    将利用预设神经网络模型在预处理后的对话文本数据和会话文本数据中标记出的质检点进行合并,得到预设神经网络模型标记出的质检点结果;
    根据预设规则将全文搜索引擎标记出的质检点结果与预设神经网络模型标记出的质检点结果整合;
    将整合后的质检点结果作为最终的质检点结果。
PCT/CN2018/102069 2018-03-22 2018-08-24 文本数据质检方法、装置、设备及计算机可读存储介质 WO2019179022A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810240050.5 2018-03-22
CN201810240050.5A CN108446388A (zh) 2018-03-22 2018-03-22 文本数据质检方法、装置、设备及计算机可读存储介质

Publications (1)

Publication Number Publication Date
WO2019179022A1 true WO2019179022A1 (zh) 2019-09-26

Family

ID=63196144

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/102069 WO2019179022A1 (zh) 2018-03-22 2018-08-24 文本数据质检方法、装置、设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN108446388A (zh)
WO (1) WO2019179022A1 (zh)

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109635918A (zh) * 2018-10-30 2019-04-16 银河水滴科技(北京)有限公司 基于云平台和预设模型的神经网络自动训练方法和装置
CN109376844A (zh) * 2018-10-30 2019-02-22 银河水滴科技(北京)有限公司 基于云平台和模型推荐的神经网络自动训练方法和装置
CN110019817A (zh) * 2018-12-04 2019-07-16 阿里巴巴集团控股有限公司 一种视频中文字信息的检测方法、装置及电子设备
CN109740759B (zh) * 2018-12-13 2024-05-03 平安科技(深圳)有限公司 学习模型优化与选择方法、电子装置及计算机设备
CN109815329A (zh) * 2018-12-13 2019-05-28 平安科技(深圳)有限公司 文本质检之模型集成与预测方法、电子装置、计算机设备
CN109740760B (zh) * 2018-12-25 2024-04-05 平安科技(深圳)有限公司 文本质检自动化训练方法、电子装置及计算机设备
CN109815487B (zh) * 2018-12-25 2023-04-18 平安科技(深圳)有限公司 文本质检方法、电子装置、计算机设备及存储介质
CN109726764A (zh) * 2018-12-29 2019-05-07 北京航天数据股份有限公司 一种模型选择方法、装置、设备和介质
CN109739989B (zh) * 2018-12-29 2021-05-18 奇安信科技集团股份有限公司 文本分类方法和计算机设备
CN110414790A (zh) * 2019-06-28 2019-11-05 深圳追一科技有限公司 确定质检效果的方法、装置、设备及存储介质
CN110610004A (zh) * 2019-09-03 2019-12-24 深圳追一科技有限公司 标注质量的检测方法、装置、计算机设备和存储介质
CN110909162B (zh) * 2019-11-15 2020-10-27 龙马智芯(珠海横琴)科技有限公司 文本质检的方法、存储介质及电子设备
CN110929011A (zh) * 2019-11-28 2020-03-27 北京思特奇信息技术股份有限公司 一种对话分析方法、装置和设备
CN111177380A (zh) * 2019-12-21 2020-05-19 厦门快商通科技股份有限公司 一种意图数据质检方法及系统
CN112468658B (zh) * 2020-11-20 2022-10-25 平安普惠企业管理有限公司 语音质量检测方法、装置、计算机设备及存储介质
CN115546574A (zh) * 2021-06-30 2022-12-30 华为技术有限公司 图像分类、模型训练方法、设备、存储介质及计算机程序
CN113657773B (zh) * 2021-08-19 2023-08-29 中国平安人寿保险股份有限公司 话术质检方法、装置、电子设备及存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105701088A (zh) * 2016-02-26 2016-06-22 北京京东尚科信息技术有限公司 从机器对话切换到人工对话的方法和装置
CN106934000A (zh) * 2017-03-03 2017-07-07 深圳市彬讯科技有限公司 一种呼叫系统的语音自动质检方法及系统
CN107093431A (zh) * 2016-02-18 2017-08-25 中国移动通信集团辽宁有限公司 一种对服务质量进行质检的方法及装置
CN107333014A (zh) * 2017-06-29 2017-11-07 上海澄美信息服务有限公司 一种智能录音质检系统
CN107705807A (zh) * 2017-08-24 2018-02-16 平安科技(深圳)有限公司 基于情绪识别的语音质检方法、装置、设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
MX2013002671A (es) * 2010-09-10 2013-07-29 Atg Advanced Swiss Technology Group Ag Metodo para deteccion y evaluacion digital de material grafico ilegal.
CN106373558B (zh) * 2015-07-24 2019-10-18 科大讯飞股份有限公司 语音识别文本处理方法及系统
CN105141787A (zh) * 2015-08-14 2015-12-09 上海银天下科技有限公司 服务录音的合规检查方法及装置
CN105894088B (zh) * 2016-03-25 2018-06-29 苏州赫博特医疗信息科技有限公司 基于深度学习及分布式语义特征医学信息抽取系统及方法
CN106202330B (zh) * 2016-07-01 2020-02-07 北京小米移动软件有限公司 垃圾信息的判断方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107093431A (zh) * 2016-02-18 2017-08-25 中国移动通信集团辽宁有限公司 一种对服务质量进行质检的方法及装置
CN105701088A (zh) * 2016-02-26 2016-06-22 北京京东尚科信息技术有限公司 从机器对话切换到人工对话的方法和装置
CN106934000A (zh) * 2017-03-03 2017-07-07 深圳市彬讯科技有限公司 一种呼叫系统的语音自动质检方法及系统
CN107333014A (zh) * 2017-06-29 2017-11-07 上海澄美信息服务有限公司 一种智能录音质检系统
CN107705807A (zh) * 2017-08-24 2018-02-16 平安科技(深圳)有限公司 基于情绪识别的语音质检方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN108446388A (zh) 2018-08-24

Similar Documents

Publication Publication Date Title
WO2019179022A1 (zh) 文本数据质检方法、装置、设备及计算机可读存储介质
CN108717406B (zh) 文本情绪分析方法、装置及存储介质
CN108491388B (zh) 数据集获取方法、分类方法、装置、设备及存储介质
WO2020077824A1 (zh) 异常问题的定位方法、装置、设备及存储介质
WO2019184217A1 (zh) 热点事件分类方法、装置及存储介质
WO2019200806A1 (zh) 文本分类模型的生成装置、方法及计算机可读存储介质
US8688690B2 (en) Method for calculating semantic similarities between messages and conversations based on enhanced entity extraction
WO2021042521A1 (zh) 一种合同自动生成方法、计算机设备及计算机非易失性存储介质
CN108701155B (zh) 社交网络中的专家检测
US8762375B2 (en) Method for calculating entity similarities
US20150006148A1 (en) Automatically Creating Training Data For Language Identifiers
US9218568B2 (en) Disambiguating data using contextual and historical information
US20160189037A1 (en) Hybrid technique for sentiment analysis
US10002187B2 (en) Method and system for performing topic creation for social data
CN113076735B (zh) 目标信息的获取方法、装置和服务器
US10216837B1 (en) Selecting pattern matching segments for electronic communication clustering
US11645456B2 (en) Siamese neural networks for flagging training data in text-based machine learning
CN111460810A (zh) 众包任务的抽检方法、装置、计算机设备及存储介质
CN114595686A (zh) 知识抽取方法、知识抽取模型的训练方法及装置
US9971762B2 (en) System and method for detecting meaningless lexical units in a text of a message
CN114092948A (zh) 一种票据识别方法、装置、设备以及存储介质
CN110728131A (zh) 一种分析文本属性的方法和装置
CN112559711A (zh) 一种同义文本提示方法、装置及电子设备
CN115827867A (zh) 文本类型的检测方法及装置
CN115600592A (zh) 文本内容的关键信息提取方法、装置、设备及介质

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18910604

Country of ref document: EP

Kind code of ref document: A1