CN114491040B - Information mining method and device - Google Patents

Information mining method and device Download PDF

Info

Publication number
CN114491040B
CN114491040B CN202210104310.2A CN202210104310A CN114491040B CN 114491040 B CN114491040 B CN 114491040B CN 202210104310 A CN202210104310 A CN 202210104310A CN 114491040 B CN114491040 B CN 114491040B
Authority
CN
China
Prior art keywords
vector
splicing
text
vectors
mined
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210104310.2A
Other languages
Chinese (zh)
Other versions
CN114491040A (en
Inventor
韩磊
刘昊
唐竑轩
尹何举
王倩倩
刘凯
李婷婷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202210104310.2A priority Critical patent/CN114491040B/en
Publication of CN114491040A publication Critical patent/CN114491040A/en
Application granted granted Critical
Publication of CN114491040B publication Critical patent/CN114491040B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The present disclosure provides an information mining method, apparatus, device, storage medium, and computer program product, which relate to the technical field of artificial intelligence, in particular to the technical field of deep learning, and can be applied to scenes such as information mining. The specific implementation scheme is as follows: acquiring an example text set and a text to be mined; carrying out vector combination and splicing on the example text set and the text to be mined to obtain a plurality of extracted spliced vectors; extracting target information from the text to be mined based on the plurality of extraction splicing vectors; and classifying the target information to obtain the type of the target information. Target information and target information types are obtained through prompting of the example text set, and the quality and the efficiency of information mining are improved.

Description

Information mining method and device
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of deep learning technologies, which can be applied to information mining and other scenarios, and in particular, to an information mining method, apparatus, device, storage medium, and computer program product.
Background
At present, a sampling strategy is usually adopted, a statistical method is utilized to extract a small amount of data from mass data for manual analysis, and target information types are obtained through summarization.
Disclosure of Invention
The present disclosure provides an information mining method, apparatus, device, storage medium, and computer program product, which improve the quality of information mining.
According to an aspect of the present disclosure, there is provided an information mining method, including: acquiring an example text set and a text to be mined; carrying out vector combination and splicing on the example text set and the text to be mined to obtain a plurality of extracted spliced vectors; extracting target information from the text to be mined based on the plurality of extraction splicing vectors; and classifying the target information to obtain the type of the target information.
According to another aspect of the present disclosure, there is provided an information mining apparatus including: the acquisition module is configured to acquire an example text set and a text to be mined; the splicing module is configured to perform vector combination and splicing on the example text set and the text to be mined to obtain a plurality of extracted splicing vectors; the extraction module is configured to extract target information from the text to be mined based on the plurality of extraction splicing vectors; and the classification module is configured to classify the target information to obtain the type of the target information.
According to still another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor, and the instructions are executed by the at least one processor to enable the at least one processor to execute the information mining method.
According to yet another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the information mining method.
According to yet another aspect of the present disclosure, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the above-mentioned information mining method.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is an exemplary system architecture diagram in which the present disclosure may be applied;
FIG. 2 is a flow diagram of one embodiment of an information mining method according to the present disclosure;
FIG. 3 is a flow diagram of another embodiment of an information mining method according to the present disclosure;
FIG. 4 is a flow diagram of yet another embodiment of an information mining method according to the present disclosure;
FIG. 5 is a schematic diagram of an information mining method according to the present disclosure;
FIG. 6 is a schematic block diagram of one embodiment of an information mining device according to the present disclosure;
fig. 7 is a block diagram of an electronic device for implementing an information mining method of an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 illustrates an exemplary system architecture 100 to which embodiments of the information mining method or apparatus of the present disclosure may be applied.
As shown in fig. 1, the system architecture 100 may include terminal devices 101, 102, 103, a network 104, and a server 105. The network 104 serves as a medium for providing communication links between the terminal devices 101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.
The user may use the terminal devices 101, 102, 103 to interact with the server 105 via the network 104 to obtain the target information type, etc. Various client applications, such as a target information classification application, etc., may be installed on the terminal devices 101, 102, 103.
The terminal apparatuses 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, 103 are hardware, they may be various electronic devices including, but not limited to, smart phones, tablet computers, laptop portable computers, desktop computers, and the like. When the terminal apparatuses 101, 102, 103 are software, they can be installed in the above-described electronic apparatuses. It may be implemented as multiple pieces of software or software modules, or as a single piece of software or software module. And is not particularly limited herein.
The server 105 may provide various services based on the determined target information type. For example, the server 105 may analyze and process the text to be mined acquired from the terminal apparatuses 101, 102, 103, and generate a processing result (e.g., determination target information, etc.).
The server 105 may be hardware or software. When the server 105 is hardware, it may be implemented as a distributed server cluster composed of a plurality of servers, or may be implemented as a single server. When the server 105 is software, it may be implemented as multiple pieces of software or software modules (e.g., to provide distributed services), or as a single piece of software or software module. And is not particularly limited herein.
It should be noted that the information mining method provided by the embodiment of the present disclosure is generally executed by the server 105, and accordingly, the information mining device is generally disposed in the server 105.
It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for an implementation.
With continued reference to FIG. 2, a flow 200 of one embodiment of an information mining method according to the present disclosure is shown. The information mining method comprises the following steps:
step 201, obtaining an example text set and a text to be mined.
In this embodiment, an executing body of the information mining method (e.g., the server 105 shown in fig. 1) may acquire an example text set and a text to be mined. The text to be mined is a text containing target information to be extracted, the text to be mined can be obtained from an existing text database, can be intercepted from the internet, and can also be written with a section of text, which is not limited by the disclosure. The sample text set contains a plurality of pieces of sample text information, and it should be noted that the sample text information in the sample text set is consistent with the content type of the text to be mined, so that the target information can be extracted from the text to be mined based on the prompt of the sample text set. The target text can be obtained from the text database, the target segments are extracted from the target text to form an example text set, and the target text can also be searched from massive texts on the internet to form the example text set.
Step 202, carrying out vector combination and splicing on the example text set and the text to be mined to obtain a plurality of extracted spliced vectors.
In this embodiment, after the execution main body obtains the example text set and the text to be mined, a plurality of extracted concatenation vectors may be further obtained. Specifically, vector conversion calculation may be performed on the example text set and the text to be mined to obtain a sample vector corresponding to each piece of sample information text in the example text set and a vector of each word in the text to be mined. And carrying out vector joint calculation on one sample vector corresponding to each section of sample information text in the example text set to obtain one joint vector corresponding to all sample information texts, and splicing the vector of each word in the text to be mined with one joint vector respectively to obtain a plurality of extracted spliced vectors.
And 203, extracting target information from the text to be mined based on the plurality of extraction splicing vectors.
In this embodiment, after obtaining the plurality of extracted concatenation vectors, the executing entity may extract the target information from the text to be mined based on the plurality of extracted concatenation vectors. Each extracted splicing vector comprises information of a word in a text to be mined and information of a joint vector corresponding to an example text set, the example text set can contain a plurality of sections of sample text information and text information types corresponding to the sample text information, and each section of sample text information can correspond to a plurality of text information types. The extracted splicing vectors can be used as input data and input into an information extraction model, the information extraction model takes a plurality of sections of sample text information in the extracted splicing vectors as extraction prompts, and target information extracted from the text to be mined is output from an output end of the information extraction model.
And step 204, classifying the target information to obtain the type of the target information.
In this embodiment, after the executing entity obtains the target information, the executing entity may classify the target information to obtain a type of the target information. Specifically, the obtained target information may be matched with sample text information in the example text set, sample text information that is the same as or similar to the target information may be obtained from the example text set, and at least one text information type corresponding to the same or similar sample text information may be obtained from the example text set as the target information type.
The information mining method provided by the embodiment of the disclosure includes the steps of firstly obtaining an example text set and a text to be mined, then carrying out vector combination and splicing on the example text set and the text to be mined to obtain a plurality of extracted splicing vectors, then extracting target information from the text to be mined based on the plurality of extracted splicing vectors, and finally classifying the target information to obtain a target information type. Based on the method, the target information can be extracted from the text to be mined by taking the example text set as reference, and the type of the target information is further acquired, so that the acquired target information is more accurate.
With further continued reference to fig. 3, a flow 300 of another embodiment of an information mining method according to the present disclosure is shown. The information mining method comprises the following steps:
step 301, obtaining an example text set and a text to be mined.
In this embodiment, the specific operation of step 301 has been described in detail in step 201 in the embodiment shown in fig. 2, and is not described herein again.
Step 302, inputting the example text set and the text to be mined into a conversion network of the multi-task model to obtain a vector of each word in the example text set and a vector of each word in the text to be mined.
In this embodiment, after obtaining the example text set and the text to be mined, the executing body may input the example text set and the text to be mined into a conversion network of a multitask model, so as to obtain a vector of each word in the example text set and a vector of each word in the text to be mined. The multitask model is a model which can comprise a plurality of networks, the networks are not completely the same, parameters can be shared among the networks, the connection and the difference of different networks can be learned, and the learning efficiency and the learning quality of each network are improved. The conversion network is a network in a multitasking model for converting text into corresponding vectors. Specifically, the example text set and the text to be mined can be input into a conversion network of the multitask model as input data, and vectors of each word in the example text set and vectors of each word in the text to be mined are output from an output end of the conversion network of the multitask model, wherein the conversion network can firstly convert each word in the example text set and each word in the text to be mined into vectors represented by 0 and 1, but the vector matrix represented by 0 and 1 is a sparse matrix and can excessively occupy resources, so that the vector matrix represented by 0 and 1 can be further multiplied by a mapping matrix to be converted into a dense matrix, and the reduced-dimension vector representation of each word in the example text set and the text to be mined is obtained, and the calculation efficiency is improved.
And 303, performing vector addition on the vector of each word in the example text set and the vector of each word in the text to be mined to obtain a joint vector.
In this embodiment, after obtaining the vector of each word in the example text set and the vector of each word in the text to be mined, the execution main body may further obtain a joint vector. Specifically, a vector of each word in the example text set and a vector of each word in the text to be mined may be vector-added, for example, the example text set and the text to be mined have 4 words, the vectors of each word represent respectively a word 1= (x 1, y 1), a word 2= (x 2, y 2), a word 3= (x 3, y 3), a word 4= (x 4, y 4), and the vectors of the 4 words are vector-added to obtain a joint vector = (x 1+ x2+ x3+ x4, y1+ y2+ y3+ y 4).
And 304, splicing the vector of each word in the text to be mined with the joint vector respectively to obtain a plurality of extracted spliced vectors.
In this embodiment, after obtaining the joint vector, the execution body may further obtain a plurality of extracted stitching vectors. Specifically, a vector of each word in the text to be mined may be respectively spliced with the union vector, for example, the text to be mined has 2 words, the vector of each word is represented by word 1= (x 1, y 1), word 2= (x 2, y 2), step 303 has obtained the union vector, for example, a union vector = (x 1+ x2+ x3+ x4, y1+ y2+ y3+ y 4) is obtained, the vector of each word of the text to be mined is respectively spliced with the union vector, for example, a extracted splice vector 1= (x 1, y1, x1+ x2+ x3+ x4, y1+ y2+ y3+ y 4) is obtained, and an extracted splice vector 2= (x 2, y2, x1+ x2+ x3+ x4, y1+ y2+ y3+ y 4) is obtained.
And 305, inputting a plurality of extracted splicing vectors into a starting point judgment network of the multitask model to obtain the extracted splicing vectors with starting point labels.
In this embodiment, after the execution main body obtains a plurality of extracted splicing vectors, the extracted splicing vectors with the starting point labels can be further screened out. The starting point judgment network is another network in the multitask model and is used for screening the target vectors and marking the starting points. Specifically, a plurality of extracted stitching vectors may be input to the starting point determination network of the multitask model as input data, the output end of the starting point determination network of the multitask model may be determined, and the extracted stitching vectors with the starting point labels may be output. Illustratively, a plurality of extracted splicing vectors are used as input data and input into a starting point judgment network of the multitask model, a full connection layer of the starting point judgment network firstly converts each extracted splicing vector into a two-dimensional vector, then a probability conversion layer of the starting point judgment network outputs the probability whether the two-dimensional vector is the target content and is the starting point or not by calculating a loss value, and if the corresponding extracted splicing vector is the target content and is the starting point, starting point labeling is performed on the extracted splicing vector, so that all the extracted splicing vectors labeled with the starting point are obtained.
It should be noted that each extracted concatenation vector is obtained by concatenating a word vector and a joint vector of a text to be mined, and therefore, each extracted concatenation vector corresponds to a word in the text to be mined, and an extracted concatenation vector marked with a starting point is obtained, which is substantially an extracted target speech segment from the text to be mined, and the target speech segment in the text to be mined may be more than one segment.
And step 306, inputting the plurality of extracted splicing vectors into an end point judgment network of the multitask model to obtain the extracted splicing vectors with end point labels.
In this embodiment, after the execution main body obtains a plurality of extracted splicing vectors, the extracted splicing vectors with end point labels can be further screened out. The end point judgment network is another network in the multitask model and is used for screening the target vector and carrying out end point marking. Specifically, a plurality of extracted stitching vectors may be input to the end point determination network of the multitask model as input data, and the extracted stitching vectors with end point labels may be output from the output end of the end point determination network of the multitask model. Illustratively, a plurality of extracted splicing vectors are used as input data and input into an end point judgment network of a multitask model, a full connection layer of the end point judgment network firstly converts each extracted splicing vector into a two-dimensional vector, then a probability conversion layer of the end point judgment network outputs the probability of whether the two-dimensional vector is a target content and is an end point or not by calculating a loss value, and if the corresponding extracted splicing vector is the target content and is the end point, end point labeling is carried out on the extracted splicing vector, so that all extracted splicing vectors marked with the end point are obtained.
It should be noted that, when a target speech segment is greater than or equal to two segments, illustratively, a plurality of extracted splicing vectors are used as input data and input into an endpoint determination network of the multitask model, a full connection layer of the endpoint determination network first converts each extracted splicing vector into a two-dimensional vector, then a probability conversion layer of the endpoint determination network outputs a probability whether the two-dimensional vector is a target content by calculating a loss value, if the target content is the target content, the extracted splicing vectors that are the target content are further grouped according to whether the extracted splicing vectors belong to the same speech segment, in each group, the endpoint of each extracted splicing vector is further determined and labeled, and each group has an endpoint sequence, so as to obtain all extracted splicing vectors labeled with endpoints.
And 307, splicing the extracted splicing vector with the starting point label and characters in the text to be mined corresponding to the extracted splicing vector with the end point label to obtain target information.
In this embodiment, the execution body may further obtain the target information after obtaining the extracted splicing vector with the start point label and the extracted splicing vector with the end point label. Each extracted splicing vector is obtained by splicing a word vector and a joint vector of the text to be mined, so that each extracted splicing vector corresponds to a word in the text to be mined, and specifically, the extracted splicing vector with the starting point label and the words in the text to be mined corresponding to the extracted splicing vector with the end point label can be spliced to obtain the target information. Illustratively, a text to be mined has 5 words, 3 words with starting point labels are extracted from the text, and the text corresponds to a starting point 1, a starting point 2 and a starting point 3, wherein the starting point 1 represents that the first word is sorted among the 3 words with starting point labels, the starting point 2 represents that the first word is sorted among the remaining 2 words with starting point labels, the starting point 3 represents that the first word is sorted among the remaining 1 words with starting point labels, the 3 words with end point labels are extracted from the text to be mined and correspond to an end point 1, an end point 2 and an end point 3, respectively, wherein the end point 1 represents that the last word is sorted among the 3 words with end point labels, the end point 2 represents that the last word is sorted among the remaining 2 words with end point labels, and the end point 3 represents that the last word is sorted among the remaining 1 words with end point labels, it is to be noted that target information in the text to be mined is determined, therefore, the extracted 3 words with starting point labels and the extracted 3 words with end point labels are the same in a correct arrangement position as a target sentence, and a target sentence is correctly arranged.
It should be noted that, when the target information is greater than or equal to two pieces, for example, the text to be mined has 5 words, and 4 words with beginning marks are extracted from the text to be mined, where 2 words belong to one language segment and correspond to a beginning 1 and a beginning 2, the other 2 words belong to another language segment and correspond to a beginning 3 and a beginning 4, and 4 words with end marks are extracted from the text to be mined, where 2 words belong to one language segment and correspond to an end 1 and an end 2, and the other 2 words belong to another language segment and correspond to an end 3 and an end 4, and based on the beginning 1, the beginning 2, the end 1 and the end 2, the corresponding 2 words are combined into a short sentence with a correct language sequence, and based on the beginning 3, the beginning 4, the end 3 and the end 4, the corresponding 2 words are combined into another short sentence with a correct language sequence, and the obtained two short sentences are used as the target information.
And 308, performing vector addition on vectors of each word of the target information to obtain a target vector.
In this embodiment, after the executing entity obtains the target information, the executing entity may further obtain a target vector. Specifically, vectors of each word of the target information may be vector-added, for example, the target information has 3 words, vector representations of each word are word 1= (x 1, y 1), word 2= (x 2, y 2), and word 3= (x 3, y 3), respectively, and the vectors of 3 words are vector-added to obtain a target vector = (x 1+ x2+ x3, y1+ y2+ y 3).
It should be noted that, when the target information is greater than or equal to two, for example, two pieces of target information are obtained, each piece of target information includes 2 words, i.e., word 1= (x 1, y 1), word 2= (x 2, y 2), and word 3= (x 3, y 3), word 4= (x 4, y 4), and vectors of each word of each piece of target information are vector-added to obtain two target vectors (x 1+ x2, y1+ y 2), and (x 3+ x4, y3+ y 4).
And 309, splicing the target vector and the joint vector to obtain a classified spliced vector.
In this embodiment, after the execution main body obtains the target vector, the classified concatenation vector may be further obtained. Illustratively, a joint vector = (x 1+ x2+ x3+ x4+ x5, y1+ y2+ y3+ y4+ y 5) is obtained in advance, the target vector is (x 1+ x2+ x3, y1+ y2+ y 3), and the target vector and the joint vector are spliced to obtain a classified spliced vector = (x 1+ x2+ x3, y1+ y2+ y3, x1+ x2+ x3+ x4+ x5, y1+ y2+ y3+ y4+ y 5). It should be noted that, when the number of target vectors is greater than or equal to 2, 2 classified splicing vectors are correspondingly obtained.
And 310, inputting the classified splicing vectors into a classification network of the multi-task model to obtain the target information type.
In this embodiment, after the execution main body obtains the classified splicing vector, the classified splicing vector may be input into a classification network of a multitask model to obtain a target information type. Wherein the classification network is another network in the multitasking model for determining the type of the target information. Specifically, the classification splicing vector may be input into a classification network of the multitask model as input data, and the target information type may be output from an output end of the classification network of the multitask model. Illustratively, at least one classified splicing vector is used as input data and is input into a classification network of a multi-task model, a full connection layer of the classification network firstly converts each classified splicing vector into a multi-dimensional vector, and then a probability conversion layer of the classification network outputs the probability of whether the multi-dimensional vector is of a certain type or not by calculating a loss value, so as to obtain a target information type corresponding to each piece of target information, wherein one piece of target information can correspond to a plurality of target information types.
As can be seen from fig. 3, compared with the embodiment corresponding to fig. 2, the information mining method in this embodiment obtains the target information and the target information type by using the multitask model, and can improve the learning efficiency and quality of each network, so that the obtained target information and the target information type are more accurate, the target information and the target information type are obtained based on the concatenation vector, global information can be considered, and the target information type are prompted by an example text set, thereby further improving the accuracy of the obtained target information and the target information type, and the information mining method in this embodiment realizes automatic calculation, improves the efficiency of information mining, reduces the data analysis cost, and can be privately deployed in the local, so that the data information is effectively protected.
With further continued reference to fig. 4, a flow 400 of yet another embodiment of an information mining method according to the present disclosure is illustrated. The information mining method comprises the following steps:
and step 401, obtaining a plurality of sample complaint segments and original customer complaint texts.
In this embodiment, the execution subject may obtain a plurality of sample complaint sections and complaint texts. The plurality of sample complaint sections are a plurality of complaint content samples, the plurality of complaint sections can be intercepted from a complaint database to serve as the plurality of sample complaint sections, and the plurality of complaint sections which can be generated can be automatically generated to serve as the plurality of sample complaint sections based on the performance of the corresponding product, which is not limited in the disclosure. The original customer complaint is the complaint content of the corresponding product of the received customer, the original customer complaint can comprise sentences irrelevant to the complaint besides complaint fragments, and the original customer complaint can be obtained from the customer complaint collecting platform.
And step 402, carrying out vector combination and splicing on the multiple sample complaint sections and the complaint original texts to obtain multiple extracted spliced vectors.
In this embodiment, after obtaining the sample complaint sections and the complaint texts, the execution subject may further obtain a plurality of extracted concatenation vectors. Specifically, vector conversion calculation may be performed on the multiple sample complaint sections and the customer complaint original text to obtain a sample vector corresponding to each sample complaint section and a vector of each word in the customer complaint original text. And carrying out vector joint calculation on one sample vector corresponding to each sample complaint section to obtain one joint vector corresponding to all the sample complaint sections, and splicing the vector of each word in the complaint original text with one joint vector to obtain a plurality of extracted spliced vectors.
And step 403, extracting a target complaint segment from the customer complaint original text based on the plurality of extraction splicing vectors.
In this embodiment, after obtaining the plurality of extracted splicing vectors, the executing entity may extract the target complaint segment from the customer complaint original text based on the plurality of extracted splicing vectors. Specifically, a plurality of extracted stitching vectors can be input into an information extraction model, the information extraction model takes a plurality of sample complaint sections in the extracted stitching vectors as extraction prompts, and a target complaint section extracted from the original complaint text is output from an output end of the information extraction model.
In some optional implementations of this embodiment, a plurality of sample complaint segments and customer complaint texts may be input into the conversion network of the multitask model to obtain a vector of each word of the plurality of sample complaint segments and customer complaint texts; vector addition is carried out on the vector of each word to obtain a joint vector; splicing the vector of each character of the customer complaint original text with the joint vector respectively to obtain a plurality of extracted spliced vectors; inputting a plurality of extracted splicing vectors into a starting point judgment network of the multitask model to obtain extracted splicing vectors with starting point labels; inputting a plurality of extracted splicing vectors into an end point judgment network of the multitask model to obtain extracted splicing vectors with end point marks; and splicing the extracted splicing vector with the starting point label and the characters in the customer complaint original text corresponding to the extracted splicing vector with the end point label to obtain a target complaint segment.
And step 404, classifying the target complaint sections to obtain the complaint types.
In this embodiment, after obtaining the target complaint segment, the execution main body may classify the target complaint segment to obtain the complaint type. Specifically, the obtained target complaint segment may be matched with a plurality of sample complaint segments, a sample complaint segment that is the same as or similar to the target complaint segment is obtained, and at least one predetermined complaint type corresponding to the same or similar sample complaint segment is obtained as the target complaint type.
In some optional implementation manners of this embodiment, vectors of each word of the target complaint segment may be vector-added to obtain a target vector; splicing the target vector and the joint vector to obtain a classified spliced vector; and inputting the classified splicing vectors into a classification network of the multi-task model to obtain the target complaint type.
As can be seen from fig. 4, compared with the embodiment corresponding to fig. 2, the information mining method in this embodiment can be applied to a scene of obtaining a target complaint segment and a complaint type, and improves accuracy and efficiency of obtaining the complaint segment and the complaint type.
With further continued reference to fig. 5, a schematic diagram 500 of an information mining method according to the present disclosure is shown. As can be seen from fig. 5, an example text set and a text to be mined may be input into the conversion network of the multitask model, and the input text to be mined is, for example: "go to the scene in the week last, a man's product is good, but the salesman explains too little professionally, the attitude is also not good", get the vector of each word in the example text set and the vector of each word in the text to be mined, and carry out the vector addition with the vector of each word in the text to be mined in the example text set, get the joint vector, and further, can splice the vector of each word in the text to be mined with the joint vector respectively, get a plurality of extraction concatenation vectors, respectively input a plurality of extraction concatenation vectors into the starting point judgment network of multitask model and the terminal point judgment network of multitask model, can extract the target information from the text to be mined, and exemplary, the target information of extraction is: "explain too little professionally", "attitude is also not good", and further, can carry out the vector addition with every word's vector in the target information, later splice with the joint vector, obtain a categorised concatenation vector, input categorised concatenation vector to the classification network of multitask model, finally obtain the target information type, exemplarily, the target information type that obtains is: the method improves the accuracy and efficiency of obtaining the target information and the type of the target information.
With further reference to fig. 6, as an implementation of the information mining method, the present disclosure provides an embodiment of an information mining apparatus, which corresponds to the method embodiment shown in fig. 2, and which may be applied in various electronic devices in particular.
As shown in fig. 6, the information mining apparatus 600 of this embodiment may include an obtaining module 601, a splicing module 602, an extracting module 603, and a classifying module 604. The obtaining module 601 is configured to obtain an example text set and a text to be mined; a splicing module 602 configured to perform vector combination and splicing on the example text set and the text to be mined to obtain a plurality of extracted spliced vectors; an extraction module 603 configured to extract target information from the text to be mined based on a plurality of extraction concatenation vectors; a classification module 604 configured to classify the target information to obtain a target information type.
In this embodiment, the information mining apparatus 600: the specific processing of the obtaining module 601, the splicing module 602, the extracting module 603, and the classifying module 604 and the technical effects thereof can refer to the related descriptions of steps 201-204 in the corresponding embodiment of fig. 2, which are not repeated herein.
In some optional implementations of this embodiment, the extraction module 603 includes: the first extraction submodule is configured to input a plurality of extraction splicing vectors into a starting point judgment network of the multitask model to obtain extraction splicing vectors with starting point labels; the second extraction submodule is configured to input the plurality of extraction splicing vectors into an end point judgment network of the multitask model to obtain extraction splicing vectors with end point labels; and the first splicing submodule is configured to splice the extracted splicing vector with the starting point label and the characters in the text to be mined corresponding to the extracted splicing vector with the end point label to obtain the target information.
In some optional implementations of this embodiment, the splicing module 602 includes: the conversion sub-module is configured to input the example text set and the text to be mined into a conversion network of the multi-task model, and obtain a vector of each word in the example text set and a vector of each word in the text to be mined; the first adding submodule is configured to perform vector addition on the vector of each word in the example text set and the vector of each word in the text to be mined to obtain a joint vector; and the second splicing submodule is configured to splice the vector of each word in the text to be mined with the joint vector respectively to obtain a plurality of extracted splicing vectors.
In some optional implementations of this embodiment, the classification module 604 includes: the second addition submodule is configured to perform vector addition on the vector of each word of the target information to obtain a target vector; the third splicing sub-module is configured to splice the target vector and the joint vector to obtain a classified splicing vector; and the first classification submodule is configured to input the classification splicing vector into a classification network of the multi-task model to obtain the target information type.
In some optional implementation manners of the embodiment, the example text set includes a plurality of sample complaint sections, and the text to be mined is a customer complaint original text; the splicing module 602 includes: the fourth splicing submodule is configured to perform vector combination and splicing on the multiple sample complaint fragments and the complaint original text to obtain multiple extracted splicing vectors; the extraction module 603 includes: a third extraction submodule configured to extract a target complaint segment from the customer complaint original text based on the plurality of extraction concatenation vectors; the classification module 604 includes: and the second classification submodule is configured to classify the target complaint segment to obtain the complaint type.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the device 700 comprises a computing unit 701 which may perform various suitable actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
A number of components in the device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 executes the respective methods and processes described above, such as the information mining method. For example, in some embodiments, the information mining method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of a computer program may be loaded onto and/or installed onto device 700 via ROM 702 and/or communications unit 709. When loaded into RAM703 and executed by the computing unit 701, may perform one or more of the steps of the information mining method described above. Alternatively, in other embodiments, the computing unit 701 may be configured to perform the information mining method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a server of a distributed system or a server incorporating a blockchain. The server can also be a cloud server, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology. The server may be a server of a distributed system or a server incorporating a blockchain. The server can also be a cloud server, or an intelligent cloud computing server or an intelligent cloud host with artificial intelligence technology.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (13)

1. An information mining method, comprising:
acquiring an example text set and a text to be mined;
performing vector combination and splicing on the example text set and the text to be mined to obtain a plurality of extracted spliced vectors, wherein the method comprises the following steps: carrying out vector joint calculation on sample vectors corresponding to sample information texts in the sample text set to obtain joint vectors corresponding to all the sample information texts, and splicing the vectors of the characters in the texts to be mined and the joint vectors to obtain a plurality of extracted spliced vectors;
extracting target information from the text to be mined based on the plurality of extraction splicing vectors;
and classifying the target information to obtain the type of the target information.
2. The method of claim 1, wherein said extracting target information from the text to be mined based on the plurality of extracted stitching vectors comprises:
inputting the plurality of extracted splicing vectors into a starting point judgment network of the multitask model to obtain extracted splicing vectors with starting point labels;
inputting the plurality of extracted splicing vectors into an end point judgment network of the multitask model to obtain extracted splicing vectors with end point labels;
and splicing the extracted splicing vector with the starting point label and the characters in the text to be mined corresponding to the extracted splicing vector with the end point label to obtain the target information.
3. The method of claim 2, wherein the vector joining and stitching the example text set and the text to be mined to obtain a plurality of extracted stitching vectors comprises:
inputting the example text set and the text to be mined into a conversion network of the multi-task model to obtain a vector of each word in the example text set and a vector of each word in the text to be mined;
vector addition is carried out on the vector of each word in the example text set and the vector of each word in the text to be mined to obtain a joint vector;
and splicing the vector of each word in the text to be mined with the joint vector respectively to obtain a plurality of extracted spliced vectors.
4. The method of claim 3, wherein the classifying the target information to obtain a target information type comprises:
vector addition is carried out on the vector of each word of the target information to obtain a target vector;
splicing the target vector and the joint vector to obtain a classified spliced vector;
and inputting the classified splicing vector into a classification network of the multi-task model to obtain the target information type.
5. The method of any of claims 1-4, wherein the example corpus of text includes a plurality of sample complaint segments, the text to be mined being a complaint original;
the vector combination and splicing of the example text set and the text to be mined to obtain a plurality of extracted spliced vectors comprises:
performing vector combination and splicing on the sample complaint sections and the complaint original texts to obtain a plurality of extracted spliced vectors;
the extracting target information from the text to be mined based on the plurality of extraction splicing vectors comprises:
extracting a target complaint segment from the customer complaint original text based on the plurality of extraction splicing vectors;
the classifying the target information to obtain the target information type includes:
and classifying the target complaint segments to obtain the complaint types.
6. An information mining apparatus, the apparatus comprising:
the acquisition module is configured to acquire an example text set and a text to be mined;
the splicing module is configured to perform vector combination and splicing on the example text set and the text to be mined to obtain a plurality of extracted splicing vectors;
the extraction module is configured to extract target information from the text to be mined based on the plurality of extraction splicing vectors;
the classification module is configured to classify the target information to obtain a target information type;
the stitching model is further configured to perform the vector union and stitching of the example text set and the text to be mined to obtain a plurality of extracted stitching vectors as follows: and carrying out vector joint calculation on sample vectors corresponding to the sample information texts in the example text set to obtain joint vectors corresponding to all the sample information texts, and splicing the vectors of the characters in the texts to be mined and the joint vectors to obtain a plurality of extracted spliced vectors.
7. The apparatus of claim 6, wherein the extraction module comprises:
the first extraction submodule is configured to input the plurality of extraction splicing vectors into a starting point judgment network of the multitask model to obtain extraction splicing vectors with starting point labels;
the second extraction submodule is configured to input the plurality of extraction splicing vectors into an end point judgment network of the multitask model to obtain extraction splicing vectors with end point labels;
and the first splicing submodule is configured to splice the extracted splicing vector with the starting point label and the characters in the text to be mined corresponding to the extracted splicing vector with the end point label to obtain the target information.
8. The apparatus of claim 7, wherein the splicing module comprises:
a conversion sub-module configured to input the example text set and the text to be mined into a conversion network of the multitask model, and obtain a vector of each word in the example text set and a vector of each word in the text to be mined;
the first adding submodule is configured to perform vector addition on the vector of each word in the example text set and the vector of each word in the text to be mined to obtain a joint vector;
and the second splicing submodule is configured to splice the vector of each word of the text to be mined with the joint vector respectively to obtain the plurality of extracted splicing vectors.
9. The apparatus of claim 8, wherein the classification module comprises:
the second addition submodule is configured to perform vector addition on the vector of each word of the target information to obtain a target vector;
the third splicing sub-module is configured to splice the target vector and the joint vector to obtain a classified splicing vector;
and the first classification submodule is configured to input the classification splicing vector into a classification network of the multi-task model to obtain the target information type.
10. The apparatus of any of claims 6-9, wherein the example text set comprises a plurality of sample complaint segments, the text to be mined being a complaint original;
the splicing module includes:
the fourth splicing submodule is configured to perform vector combination and splicing on the sample complaint sections and the complaint original text to obtain a plurality of extracted splicing vectors;
the extraction module comprises:
a third extraction submodule configured to extract a target complaint segment from the customer complaint original text based on the plurality of extraction stitching vectors;
the classification module comprises:
and the second classification submodule is configured to classify the target complaint segment to obtain a complaint type.
11. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.
12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.
13. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any one of claims 1-5.
CN202210104310.2A 2022-01-28 2022-01-28 Information mining method and device Active CN114491040B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210104310.2A CN114491040B (en) 2022-01-28 2022-01-28 Information mining method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210104310.2A CN114491040B (en) 2022-01-28 2022-01-28 Information mining method and device

Publications (2)

Publication Number Publication Date
CN114491040A CN114491040A (en) 2022-05-13
CN114491040B true CN114491040B (en) 2022-12-02

Family

ID=81477023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210104310.2A Active CN114491040B (en) 2022-01-28 2022-01-28 Information mining method and device

Country Status (1)

Country Link
CN (1) CN114491040B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325167A (en) * 2017-07-31 2019-02-12 株式会社理光 Characteristic analysis method, device, equipment, computer readable storage medium
CN110569500A (en) * 2019-07-23 2019-12-13 平安国际智慧城市科技股份有限公司 Text semantic recognition method and device, computer equipment and storage medium
CN111198939A (en) * 2019-12-27 2020-05-26 北京健康之家科技有限公司 Statement similarity analysis method and device and computer equipment
CN111382271A (en) * 2020-03-09 2020-07-07 支付宝(杭州)信息技术有限公司 Training method and device of text classification model and text classification method and device
CN111401063A (en) * 2020-06-03 2020-07-10 腾讯科技(深圳)有限公司 Text processing method and device based on multi-pool network and related equipment
CN112417153A (en) * 2020-11-20 2021-02-26 虎博网络技术(上海)有限公司 Text classification method and device, terminal equipment and readable storage medium
CN112820367A (en) * 2021-01-11 2021-05-18 平安科技(深圳)有限公司 Medical record information verification method and device, computer equipment and storage medium
CN113157876A (en) * 2021-03-18 2021-07-23 平安普惠企业管理有限公司 Information feedback method, device, terminal and storage medium
CN113763156A (en) * 2021-09-14 2021-12-07 深圳前海微众银行股份有限公司 Model training method, device, equipment and storage medium
CN113821589A (en) * 2021-06-10 2021-12-21 腾讯科技(深圳)有限公司 Text label determination method and device, computer equipment and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110569508A (en) * 2019-09-10 2019-12-13 重庆邮电大学 Method and system for classifying emotional tendencies by fusing part-of-speech and self-attention mechanism
CN111159412B (en) * 2019-12-31 2023-05-12 腾讯科技(深圳)有限公司 Classification method, classification device, electronic equipment and readable storage medium
CN111914097A (en) * 2020-07-13 2020-11-10 吉林大学 Entity extraction method and device based on attention mechanism and multi-level feature fusion
CN113392210A (en) * 2020-11-30 2021-09-14 腾讯科技(深圳)有限公司 Text classification method and device, electronic equipment and storage medium
CN112507039A (en) * 2020-12-15 2021-03-16 苏州元启创人工智能科技有限公司 Text understanding method based on external knowledge embedding

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109325167A (en) * 2017-07-31 2019-02-12 株式会社理光 Characteristic analysis method, device, equipment, computer readable storage medium
CN110569500A (en) * 2019-07-23 2019-12-13 平安国际智慧城市科技股份有限公司 Text semantic recognition method and device, computer equipment and storage medium
CN111198939A (en) * 2019-12-27 2020-05-26 北京健康之家科技有限公司 Statement similarity analysis method and device and computer equipment
CN111382271A (en) * 2020-03-09 2020-07-07 支付宝(杭州)信息技术有限公司 Training method and device of text classification model and text classification method and device
CN111401063A (en) * 2020-06-03 2020-07-10 腾讯科技(深圳)有限公司 Text processing method and device based on multi-pool network and related equipment
CN112417153A (en) * 2020-11-20 2021-02-26 虎博网络技术(上海)有限公司 Text classification method and device, terminal equipment and readable storage medium
CN112820367A (en) * 2021-01-11 2021-05-18 平安科技(深圳)有限公司 Medical record information verification method and device, computer equipment and storage medium
CN113157876A (en) * 2021-03-18 2021-07-23 平安普惠企业管理有限公司 Information feedback method, device, terminal and storage medium
CN113821589A (en) * 2021-06-10 2021-12-21 腾讯科技(深圳)有限公司 Text label determination method and device, computer equipment and storage medium
CN113763156A (en) * 2021-09-14 2021-12-07 深圳前海微众银行股份有限公司 Model training method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN114491040A (en) 2022-05-13

Similar Documents

Publication Publication Date Title
CN112926306B (en) Text error correction method, device, equipment and storage medium
CN113657483A (en) Model training method, target detection method, device, equipment and storage medium
CN114429633A (en) Text recognition method, model training method, device, electronic equipment and medium
CN113657395A (en) Text recognition method, and training method and device of visual feature extraction model
CN114428677A (en) Task processing method, processing device, electronic equipment and storage medium
CN114445047A (en) Workflow generation method and device, electronic equipment and storage medium
CN114861677A (en) Information extraction method, information extraction device, electronic equipment and storage medium
CN112699237B (en) Label determination method, device and storage medium
CN114186681A (en) Method, apparatus and computer program product for generating model clusters
CN113378855A (en) Method for processing multitask, related device and computer program product
CN113407610A (en) Information extraction method and device, electronic equipment and readable storage medium
CN113657088A (en) Interface document analysis method and device, electronic equipment and storage medium
CN114491040B (en) Information mining method and device
CN114461665B (en) Method, apparatus and computer program product for generating a statement transformation model
CN115329132A (en) Method, device and equipment for generating video label and storage medium
CN115687717A (en) Method, device and equipment for acquiring hook expression and computer readable storage medium
CN113360672B (en) Method, apparatus, device, medium and product for generating knowledge graph
CN114444514A (en) Semantic matching model training method, semantic matching method and related device
CN114329164A (en) Method, apparatus, device, medium and product for processing data
CN113806522A (en) Abstract generation method, device, equipment and storage medium
CN113850072A (en) Text emotion analysis method, emotion analysis model training method, device, equipment and medium
CN113326461A (en) Cross-platform content distribution method, device, equipment and storage medium
CN112560481A (en) Statement processing method, device and storage medium
CN115965018B (en) Training method of information generation model, information generation method and device
CN112989797B (en) Model training and text expansion methods, devices, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant