CN111444326A - Text data processing method, device, equipment and storage medium - Google Patents

Text data processing method, device, equipment and storage medium Download PDF

Info

Publication number
CN111444326A
CN111444326A CN202010239303.4A CN202010239303A CN111444326A CN 111444326 A CN111444326 A CN 111444326A CN 202010239303 A CN202010239303 A CN 202010239303A CN 111444326 A CN111444326 A CN 111444326A
Authority
CN
China
Prior art keywords
keyword
text
domain
sample data
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010239303.4A
Other languages
Chinese (zh)
Other versions
CN111444326B (en
Inventor
缪畅宇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010239303.4A priority Critical patent/CN111444326B/en
Publication of CN111444326A publication Critical patent/CN111444326A/en
Application granted granted Critical
Publication of CN111444326B publication Critical patent/CN111444326B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Abstract

The embodiment of the application discloses a text data processing method, a text data processing device, text data processing equipment and a storage medium, wherein the method comprises the following steps: determining a first keyword in the initial sample data, and acquiring candidate text data corresponding to a second keyword having an incidence relation with the first keyword from a keyword database; determining the association degree between the initial sample data and the candidate text data, screening the candidate text data of which the association degree meets the sample screening condition from the candidate text data, and taking the screened candidate text data as enhanced text data; determining a training sample pair according to the enhanced text data and the initial sample data; training an initial text matching model used for capturing the keyword identification based on the training sample pair, and determining a target text matching model used for predicting the matching degree of the prediction sample pair according to the trained initial text matching model. By the method and the device, the recognition capability of the keywords can be improved, and the accuracy of text matching is further improved.

Description

Text data processing method, device, equipment and storage medium
Technical Field
The present application relates to the field of internet technologies, and in particular, to a method, an apparatus, a device, and a storage medium for processing text data.
Background
With the development of Artificial Intelligence (AI), Natural language Processing (Natural L angle Processing, N L P) is widely used in the fields of search, recommendation, conversation, etc. generally, a text a in a pair of texts refers to a user question, and a text B refers to a content source to be matched, such as a question in a question and answer library, the content of a web page, a text description of a product, and the like.
For the convenience of understanding, taking a text processing system in the existing search field as an example, when two text data in a certain text pair (for example, text a is very audible in terms of judgmental singing, and text B is very audible in terms of zhouhuajia singing) are text-matched in a text processing system, because the two text data have high similarity on the schema, when a text matching model in the text processing system performs text matching, a phenomenon that the two text data are mistakenly considered to belong to similar text data exists, so that the text processing system finally outputs text data which is not matched with the text a entered by the user. Therefore, in the process of performing text matching by using the prior art, it is inevitable that some text data with confusion (for example, semantically close or semantically close) are difficult to distinguish, and the accuracy of text matching is further reduced.
Disclosure of Invention
The embodiment of the application provides a text data processing method, a text data processing device and a storage medium, which can improve the recognition capability of keywords and further improve the accuracy of text matching.
An embodiment of the present application provides a text data processing method, including:
acquiring initial sample data, determining a first keyword in the initial sample data through a domain keyword in a keyword database, and acquiring candidate text data corresponding to a second keyword which has an incidence relation with the first keyword;
determining the association degree between the initial sample data and the candidate text data, screening the candidate text data of which the association degree meets the sample screening condition from the candidate text data, and taking the screened candidate text data as the enhanced text data corresponding to the initial sample data;
determining a training sample pair having an incidence relation with the keyword database according to the enhanced text data and the initial sample data; each sample data in the training sample pair carries a keyword identification corresponding to a domain keyword in the keyword database;
training an initial text matching model used for capturing the keyword identification based on the training sample pair, and determining the trained initial text matching model as a target text matching model; the target text matching model is subsequently used for predicting the matching degree of the obtained prediction sample pair.
An aspect of an embodiment of the present application provides a text data processing apparatus, where the apparatus includes:
the keyword identification module is used for acquiring initial sample data, determining a first keyword in the initial sample data through a domain keyword in a keyword database, and acquiring candidate text data corresponding to a second keyword which has an incidence relation with the first keyword;
the association degree determining module is used for determining the association degree between the initial sample data and the candidate text data, screening the candidate text data of which the association degree meets the sample screening condition from the candidate text data, and taking the screened candidate text data as the enhanced text data corresponding to the initial sample data;
the training pair determining module is used for determining a training sample pair which has an incidence relation with the keyword database according to the enhanced text data and the initial sample data; each sample data in the training sample pair carries a keyword identification corresponding to a domain keyword in the keyword database;
the target model determining module is used for training an initial text matching model used for capturing the keyword identification based on the training sample pair, and determining the trained initial text matching model as a target text matching model; the target text matching model is subsequently used for predicting the matching degree of the obtained prediction sample pair.
The initial sample data is text data in a sample labeling area, and the sample labeling area is an area in a text database which has an incidence relation with the initial sample data;
the device still includes:
the associated text acquisition module is used for determining the field to which the initial sample data belongs as a first field in the sample labeling area and acquiring the associated text matched with the field label of the first field from the text database; the text database comprises a second domain except the first domain;
the domain dictionary building module is used for screening and determining domain keywords matched with the first domain from candidate words formed by word segmentation of the associated text based on keyword screening conditions associated with the text database, and building a first domain dictionary corresponding to the first domain based on the domain keywords matched with the first domain;
and the keyword library determining module is used for acquiring a second field dictionary corresponding to the second field and determining a keyword database associated with the sample labeling area based on the first field dictionary and the second field dictionary.
Wherein, the domain dictionary construction module comprises:
the word segmentation processing unit is used for carrying out word segmentation processing on the associated text to obtain a word segmentation set associated with the word segmentation of the associated text, combining each word segmentation in the word segmentation set to obtain a candidate word associated with the associated text, and determining the cross correlation degree between each word segmentation in the candidate words;
the candidate word screening unit is used for acquiring a cross-correlation threshold value in the keyword screening conditions associated with the text database, screening candidate words with the cross-correlation degree larger than the cross-correlation threshold value from the candidate words, and taking the screened candidate words as character strings to be processed;
the influence degree determining unit is used for determining the influence degree of the character string to be processed in the first field, screening the character string to be processed with the influence degree reaching the keyword screening condition from the character string to be processed, and taking the screened character string to be processed as a field keyword matched with the first field; the influence degree is determined by the frequency of the character string to be processed appearing in the first field and the frequency of the character string to be processed appearing in the second field;
and the domain dictionary constructing unit is used for constructing a first domain dictionary corresponding to the first domain based on the domain keywords matched with the first domain.
Wherein, keyword recognition module includes:
the keyword identification unit is used for acquiring initial sample data from the sample labeling area, acquiring a first domain dictionary from the keyword database, and identifying domain keywords in the initial sample data based on the first domain dictionary;
a keyword identification unit configured to take a domain keyword recognized in the initial sample data as a first keyword;
a target text acquisition unit, configured to acquire a target associated text including a first keyword from associated texts included in the keyword database, and use a domain keyword in the target associated text as a second keyword;
and the candidate text determining unit is used for taking the target associated text containing the second keyword as candidate text data corresponding to the second keyword which has an association relation with the first keyword.
Wherein, the relevancy determination module comprises:
the relevancy determining unit is used for determining the relevancy between the initial sample data and the candidate text data according to the coverage ratio between the first keyword in the initial sample data and the second keyword in the candidate text data;
the association degree sorting unit is used for sorting the association degrees in the candidate text data to obtain text data to be processed corresponding to the candidate text data;
the text to be processed screening unit is used for screening the text data to be processed with the association degree larger than a first association threshold value and smaller than a second association threshold value from the sorted text data to be processed;
the enhanced text determining unit is used for taking the screened text data to be processed as enhanced text data corresponding to the initial sample data; the first correlation threshold is less than the second correlation threshold, and the first correlation threshold and the second key threshold are both thresholds in the sample screening condition.
Wherein, the training sample pair comprises first sample data and second sample data; the first sample data comprises initial sample data carrying keyword identification; the second sample data comprises enhanced text data carrying keyword identification; the first sample data comprises initial sample data carrying keyword identification; the second sample data comprises enhanced text data carrying keyword identification;
the object model determination module includes:
a domain word recognition unit, configured to use the initial text matching model to take a domain keyword corresponding to the keyword identifier in the first sample data as a first domain keyword, and take a domain keyword corresponding to the keyword identifier in the second sample data as a second domain keyword;
the segmentation feature extraction unit is used for acquiring first segmentation feature information of a first segmentation in the first sample data and second segmentation feature information of a second segmentation in the second sample data;
the model training unit is used for training the training sample pairs based on the first segmentation feature information, the second segmentation feature information, the first domain keywords, the second domain keywords and the initial text matching model to obtain training classification results;
and the target model determining unit is used for determining the trained initial text matching model as the target text matching model when the training classification result is detected to meet the classification convergence condition.
The initial text matching model comprises a text matching model in a first service scene; the text matching model under the first service scene comprises a keyword attention layer, a fusion layer and a classification layer;
the model training unit includes:
the first attention output subunit is used for inputting the first word segmentation characteristic information of the first word segmentation and the second word segmentation characteristic information of the second domain keyword into the keyword attention layer and outputting first attention characteristic information corresponding to the keyword attention layer; the first attention characteristic information is used for representing the correlation between the second domain keyword and the first segmentation;
the second attention output subunit is used for inputting second word segmentation characteristic information of the second word segmentation and first word segmentation characteristic information of the first domain keyword into the keyword attention layer and outputting second attention characteristic information corresponding to the keyword attention layer; the second attention characteristic information is used for representing the correlation between the first domain keyword and the second participle;
the semantic feature fusion subunit is used for acquiring first semantic feature information of the first sample data and second semantic feature information of the second sample data, and performing semantic fusion on the first semantic feature information and the second semantic feature information to obtain fused semantic feature information;
and the fusion vector output subunit is used for inputting the first attention feature information, the second attention feature information and the fusion feature information into the fusion layer, outputting the fusion feature vectors corresponding to the training sample pairs, and outputting the training classification results of the training sample pairs through the classification layer.
The initial text matching model comprises a text matching model in a second service scene; the text matching model under the second service scene comprises a feature combination layer, an average pooling layer, a full connection layer and a classification layer; the first segmentation comprises M first sub-segmentation except the first domain keyword; the second participles comprise N second sub-participles except the second field keywords; m and N are positive integers;
the model training unit includes:
a first feature obtaining subunit, configured to determine, in the first sample data, first sub-position information of M first sub-participles and second sub-position information of the first domain keyword based on the first participle feature information, and obtain, according to the M first sub-position information and the second sub-position information, first autocorrelation feature information of a first autocorrelation word formed by the M first sub-participles and the first domain keyword in the first sample data;
a second feature obtaining subunit, configured to determine, in second sample data, third sub-position information of the N second sub-participles and fourth sub-position information of the second domain keyword based on the second participle feature information, and obtain, in the second sample data, second autocorrelation feature information of a second autocorrelation word formed by the N second sub-participles and the second domain keyword according to the N third sub-position information and the fourth sub-position information;
the interactive feature obtaining subunit is configured to obtain interactive feature information corresponding to the cross-correlation word between the first sample data and the second sample data;
and the pooling vector output subunit is used for outputting the pooling vectors corresponding to the average pooling layer by taking the first autocorrelation characteristic information, the second autocorrelation characteristic information and the interaction characteristic information as input characteristics of the average pooling layer, and training the training sample pairs according to the pooling vectors, the full-link layer and the classification layer to obtain training classification results.
Wherein, the model training unit further comprises:
a first related word determining subunit, configured to perform word segmentation combination on the M first sub-word segmentations and the first domain keyword in the first sample data according to the M first sub-position information and the second sub-position information, and use a combined word obtained after the word segmentation combination as a first auto-related word of the first sample data;
a second related word determining subunit, configured to perform word segmentation combination on the N second sub-participles and the second domain keyword in second sample data according to the N third sub-position information and the fourth sub-position information, and use a combined word obtained after the word segmentation combination as a second auto-related word of the first sample data;
and the cross-correlation word determining subunit is used for performing word segmentation combination on the M first sub-participles, the first field keyword, the N second sub-participles and the second field keyword, and taking a combined word obtained after the word segmentation combination as a cross-correlation word between the first sample data and the second sample data.
Wherein, the model training unit further comprises:
the first feature identification subunit is configured to, if the text matching model in the second service scenario identifies that a cross-correlation word having the same content as the first auto-correlation word exists in the cross-correlation words, perform feature identification on the identified cross-correlation word having the same content as the first auto-correlation word in the cross-correlation words to obtain a first identification word segment; the interactive feature information corresponding to the first identification participle is different from the first autocorrelation feature information of the first autocorrelation word corresponding to the first identification participle;
the second feature identification subunit is configured to, if the text matching model in the second service scenario identifies that a cross-correlation word having the same content as the second auto-correlation word exists in the cross-correlation words, perform feature identification on the identified cross-correlation word having the same content as the second auto-correlation word in the cross-correlation words to obtain a second identification word segment; the interactive feature information corresponding to the second identification participle is different from the second autocorrelation feature information of the second autocorrelation word corresponding to the second identification participle.
Wherein, the pooling vector output subunit comprises:
the first combined word screening subunit is used for screening combined words containing first domain keywords and second domain keywords from first auto-related words corresponding to the first auto-related feature information, second auto-related words corresponding to the second auto-related feature information and cross-related words corresponding to the interaction feature information, taking the screened combined words as first classified combined words, and obtaining first combined feature information corresponding to the first classified combined words;
the second combined word determining unit is used for taking the combined words except the first field key words and the second field key words in the first auto-related words, the second auto-related words and the cross-related words as second classified combined words and acquiring second combined characteristic information corresponding to the second classified combined words;
the characteristic vector acquiring subunit is used for acquiring a first characteristic vector corresponding to the first combined characteristic information and a second characteristic vector corresponding to the second combined characteristic information;
the first adjustment training unit is used for adjusting vector values in the second feature vectors, taking the vector values of the first feature vectors and the adjusted vector values of the second feature vectors as first model parameters of a text matching model in a second service scene, inputting the first feature vectors and the adjusted second feature vectors into an average pooling pool corresponding to the first model parameters, outputting first pooling vectors corresponding to the average pooling layer, and training pairs of training samples according to the first pooling vectors, the full connection layer and the classification layer to obtain training classification results corresponding to the first model parameters;
and the second adjustment training unit is used for adjusting the vector value in the first feature vector if the training classification result corresponding to the first model parameter indicates that the first model parameter does not meet the convergence condition, inputting the adjusted vector value of the first feature vector and the adjusted vector value of the second feature vector as second model parameters of the text matching model in the second service scene, inputting the adjusted first feature vector and the adjusted second feature vector into an average pooling pool corresponding to the second model parameter, outputting the second pooling vector corresponding to the average pooling layer, and training the training sample pair according to the second pooling vector, the full connection layer and the classification layer to obtain a training classification result corresponding to the second model parameter.
Wherein, the device still includes:
the text entry module is used for acquiring third sample data which is entered by a target user through a target application corresponding to a second service scene;
the text screening module is used for screening fourth sample data with the same field label as the third sample data from a text library corresponding to the target application, and taking the third sample data and the fourth sample data as a prediction sample pair; the fourth sample data is text data in a text database corresponding to the keyword database;
the matching degree prediction module is used for inputting the prediction sample pair into a target text matching model and predicting the matching degree of third sample data and fourth sample data in the prediction sample pair;
and the matching text returning module is used for returning the matching text corresponding to the fourth sample data to the user terminal corresponding to the target user based on the matching degree.
An aspect of an embodiment of the present application provides a computer device, where the computer device includes: a processor, a memory, and a network interface;
the processor is coupled to the memory and the network interface, wherein the network interface is configured to provide data communication functionality, the memory is configured to store program code, and the processor is configured to invoke the program code to perform a method according to an aspect of an embodiment of the present application.
An aspect of the embodiments of the present application provides a computer storage medium storing a computer program, where the computer program includes program instructions that, when executed by a processor, perform a method according to an aspect of the embodiments of the present application.
When initial sample data is obtained, a first keyword in the initial sample data can be identified through a domain keyword in a keyword database, and candidate text data corresponding to a second keyword having an incidence relation with the first keyword can be obtained from the keyword database; further, the method and the device for processing the text data can determine the association degree between the initial sample data and the candidate text data, screen the candidate text data of which the association degree meets the sample screening condition from the candidate text data, and take the screened candidate text data as the enhanced text data corresponding to the initial sample data; further, determining a training sample pair having an association relation with the keyword database according to the enhanced text data and the initial sample data; each sample data in the training sample pair carries a keyword identification corresponding to a domain keyword in a keyword database; further, training an initial text matching model used for capturing the keyword identification based on the training sample pair, and determining the trained initial text matching model as a target text matching model; and the target text matching model is subsequently used for predicting the matching degree of the obtained prediction sample pair. Therefore, when the initial sample data is obtained, the embodiment of the application can identify the keywords in the initial sample data through the domain keywords in the keyword database, and then, candidate text data corresponding to a second keyword having an association relation with the first keyword can be automatically screened and obtained based on the identified keyword, in order to improve the classification capability of the initial text matching model on the predicted text pair, the embodiment of the application can screen candidate text data which can cause strong interference on initial sample data from a candidate text database based on the association degree between the initial sample data and the candidate text data, taking the screened subsequent text data as the enhanced text data corresponding to the initial sample data, therefore, the recognition capability of the initial text matching model to the text pair can be enhanced in the model training process. In addition, the initial matching model with the keyword identification capturing capability is introduced, so that the field keywords with the keyword identifications in the training text pairs can be effectively obtained, and the accuracy of text matching can be further improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application;
FIG. 2 is a system diagram of a text processing system according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a text data processing method according to an embodiment of the present application;
fig. 4 is a schematic view of a scenario for determining candidate text data according to an embodiment of the present application;
FIG. 5 is a scene diagram of constructing a domain dictionary according to an embodiment of the present application;
FIG. 6 is a scene diagram of a first text matching model provided in an embodiment of the present application;
FIG. 7 is a scene diagram of a second text matching model provided in an embodiment of the present application;
fig. 8 is a schematic flowchart of a text data processing method according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a scenario in which a matching degree is predicted by a target text matching model according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a text data processing apparatus according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
Please refer to fig. 1, which is a schematic structural diagram of a network architecture according to an embodiment of the present application. As shown in fig. 1, the network architecture may be applied to a text processing system in a service scenario of search, recommendation, conversation, and the like, where the text processing system may include a service server 2000 and a user terminal cluster, and the user terminal cluster may include a plurality of user terminals, as shown in fig. 1, and specifically may include a user terminal 3000a, a user terminal 3000b, user terminals 3000c and …, and a user terminal 3000 n; as shown in fig. 1, the user terminal 3000a, the user terminal 3000b, the user terminals 3000c, …, and the user terminal 3000n may be respectively in network connection with the service server 2000, so that each user terminal may perform data interaction with the service server 2000 through the network connection.
As shown in fig. 1, each ue in the ue cluster may be integrally installed with a target application, and when the target application runs in each ue, the target application may perform data interaction with the service server 2000 shown in fig. 1. The target application may be understood as an application capable of loading and displaying electronic text data in the service scenario, for example, the target application may specifically include: in-vehicle applications, smart home applications (e.g., smart audio), text understanding applications, entertainment applications, multimedia applications, reading applications, and search applications, recommendation applications, etc. running in a browser. The electronic text data in the embodiment of the present application may include internet data information corresponding to the corresponding service application.
The text processing system in the embodiment of the present application may be used to perform text classification on two text data in a text pair. The text classification herein mainly refers to a process in which a computer device can automatically classify text data (e.g., text data a) input by a target user according to a certain category system through a corresponding text processing method. Alternatively, when the text classification herein belongs to a binary classification problem, it may be understood as performing text matching on two text data in a text pair. Text matching is understood here to mean the automated matching of a pair of texts < A, B > by a computer device by means of corresponding text processing methods. For example, whether < A, B > is similar or whether < A, B > constitutes < question, answer > or the like can be judged by the degree of matching between text data a and text data B. For example, in the question-answering system, the text data a may be a question text entered by a certain user through touch or the like, and the text data B may be an answer text in an answer database. For another example, in a search system, the text data a may be a search text entered by a user, and the text data B may be a content text to be matched with the search text, for example, a web page content text, a video description text, a picture description text, and the like. For another example, in the dialog system, the text data a may be description text data of a certain product (for example, a child accompanying robot, etc.) entered by the user by voice, etc., and the text data B may be content text (for example, link text) to be matched with the text data a, etc.
It is to be understood that the text processing methods herein relate to natural language processing directions in the field of artificial intelligence. According to the embodiment of the application, a first keyword in given sample data (which can also be called as initial sample data) can be quickly identified through a keyword dictionary in a keyword database, and candidate text data corresponding to a second keyword having an association relation with the first keyword can be obtained from the text word keyword database. It is to be understood that the first keyword may be a domain keyword in a domain to which the initial sample data belongs, and the second keyword may be a domain keyword in the same domain, which has an association relationship with the first keyword. It should be understood that the candidate text data can be subsequently used for constructing high-quality negative samples to train a text matching model in the text processing system in a corresponding service scenario, so that the trained text matching model can better distinguish some negative samples with certain confusion, and further the accuracy of text matching of the text processing system can be guaranteed.
It is understood that Artificial Intelligence (AI) is a theory, method, technique, and application that utilizes a digital computer or a digital computer controlled machine to simulate, extend, and extend human Intelligence, perceive the environment, acquire knowledge, and use the knowledge to obtain optimal results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence software technology mainly comprises a computer vision technology, a machine learning/deep learning direction and the like.
Natural language processing (N L P) is an important direction in the fields of computer science and artificial intelligence, and it is a research on various theories and methods that enable efficient communication between people and computers using natural language.
When the association degree between the initial sample data and the candidate text data is obtained through calculation, the candidate text data with high quality and capable of causing strong interference to the initial sample data can be screened out from the candidate text data, and therefore the screened candidate text data can be collectively referred to as enhanced text data corresponding to the initial sample data. It should be understood that the number of the enhanced text data screened out from the candidate text data with strong interference may include one or more, and will not be limited herein. In view of this, the embodiment of the present application may use the given initial sample data and the screened enhanced text data as a training sample pair for training the text matching model, so that the trained text matching model can improve the capability of text classification. For convenience of understanding, the text matching model before the text training may be collectively referred to as an initial text matching model, and the text matching model after the text training may be collectively referred to as a target text matching model.
It should be understood that, in the process of text classification of two text data in a text pair by using the target text matching model in the text processing system, not only the feature information of a single text data but also the interaction feature information between the two text data need to be considered. In addition, by integrating the domain keywords in the keyword database into the initial text matching model, the capturing capability of the initial text matching model on the domain keywords in the training sample pair carrying the keyword identification can be improved in the model training stage, so that the target text matching model for predicting the matching degree of the sample pair can be quickly and effectively obtained, and the accuracy of text classification of the initial text matching model on the text pair (text pair for short) after training can be effectively improved.
For easy understanding, please refer to fig. 2, which is a system diagram of a text processing system according to an embodiment of the present application. The text processing system 100a shown in fig. 2 belongs to a completely new text matching framework, and the accuracy of text matching on a text pair can be effectively improved by the text processing system 100 a. It can be understood that the text processing system 100a may be applied to any one of the service scenarios corresponding to the target application, and in order to facilitate understanding, in the embodiment of the present application, a service scenario corresponding to a target application (for example, a reading application) is taken as an example of a search scenario to set forth a specific process for training the text matching model 60a shown in fig. 2 for the search scenario. The text processing system 100a shown in fig. 2 may include at least the following core modules: the system comprises a text data storage module, a keyword extraction module, a data enhancement module and a model improvement module.
The text data storage module may be configured to store text data corresponding to the multiple service scenarios to obtain a text database 10a shown in fig. 2, where the text data located in the text database 10a may include search text data corresponding to the search scenarios, and the search text data may include text data entered by a target user through an application display interface of the target application by the target user terminal (e.g., the user terminal 3000 a). The text database 10a may be collectively referred to as an open domain for storing massive text data, where the text data in the open domain may include, but is not limited to, some search texts that are difficult to strictly analyze sentence components, for example, a certain type of search text data may be expressed in a manner of multiple search questions; optionally, the text data in the open domain may further include some matching texts that cannot provide an accurate answer, for example, a plurality of text data with a higher matching degree may be found for a certain search text data. As shown in fig. 2, the text database 10a may further include a sample labeling area 20a shown in fig. 2, and the sample identification area may include a plurality of sample data for training the text matching model 60 a. For example, the plurality of sample data may specifically include the sample data 30 a. For ease of understanding, the embodiment of the present application takes a sample data (e.g., sample data 30n) selected in the sample labeling area 20a as a given sample data (i.e., initial sample data) to describe a process of obtaining enhanced text data in the text processing system 100a, which can cause strong interference to the initial sample data.
The keyword extraction module may be configured to construct a keyword system 40a shown in fig. 2, where a keyword database in the keyword system 40a may specifically include a keyword dictionary obtained by merging domain dictionaries of multiple business domains (domains for short).
The data enhancement module may be configured to configure high-quality negative sample data for given sample data (i.e., the sample data 30n), and in the embodiment of the present application, the negative sample data configured for the initial sample data and having strong interference may be collectively referred to as enhanced text data. The number of the enhanced text data may be one or more, and will not be limited herein. It is understood that the embodiment of the present application may also refer to the training text pair formed by the enhanced text data and the initial sample data system as the training sample 50a shown in fig. 2. In other words, the training samples 50a in the embodiment of the present application may be specifically used for constructing a plurality of training sample pairs. For example, a training sample pair may be composed of an initial sample data and an enhanced text data.
As shown in fig. 2, in the embodiment of the present application, the keyword system 40a shown in fig. 2 may further act on the training sample 50a shown in fig. 2, so that a keyword identifier can be set in the training sample 50a for a domain keyword in each sample data, so as to obtain the training sample 50b shown in fig. 2 and carrying the keyword identifier. It can be understood that each sample data in the training sample 50b shown in fig. 2 carrying the keyword identifier carries the keyword identifier. For convenience of understanding, in the embodiments of the present application, the initial sample data carrying the keyword identifier may be collectively referred to as first sample data, and the enhanced text data carrying the keyword identifier may be collectively referred to as second sample data.
In other words, in the embodiment of the present application, the domain keywords in each sample data participating in training can be recognized by the keyword system 40a of fig. 2, and then keyword identifiers can be set for the recognized domain keywords, so as to obtain the training sample 50b of fig. 2 carrying the keywords. Thus, when the sample data carrying the keyword identifiers is sent to the text matching model 60a of fig. 2 for model training, the text matching model (e.g., text matching model 1) matching the search scenario can be intelligently selected through the model improvement module in the text matching model 60 a. It can be understood that the key information module provided in the text matching model 1 can be used to quickly capture the domain keywords in the first sample data and the second sample data. For example, in the embodiment of the present application, a domain keyword corresponding to a keyword identifier captured in first sample data may be used as a first domain keyword, and a domain keyword corresponding to a keyword identifier captured in second sample data may be used as a second domain keyword, and then the text matching model 60a shown in fig. 2 may be trained through the first domain keyword, the second domain keyword, the first sample data, and the second sample data, so that when a prediction sample 70b (i.e., a prediction sample pair) shown in fig. 2 that carries a keyword identifier is obtained, accuracy of text matching of the text matching model to the prediction sample pair after training may be improved.
Therefore, by adding the key information module in the text matching model 1, the field keywords with the keyword identifiers in the sample data can be rapidly distinguished and obtained, and further, in the process of training the text matching model 60a, the text matching model 60a can be ensured to be capable of learning the feature information of the field keywords in the same text pair in a targeted manner, so that the accuracy of text classification can be improved. In addition, in the embodiment of the present application, the keyword system 40a may also be applied to the prediction samples 70a shown in fig. 2 to add keyword identifiers to the sample pairs participating in prediction. It should be understood that, in the embodiment of the present application, by adding a keyword identifier to a prediction sample, third text data carrying the keyword identifier and fourth text data carrying the keyword identifier may be obtained, so that when the third text data and the fourth text data are used as a prediction sample pair (for example, a prediction sample pair a), it is equivalent to adding additional prior information to the trained text matching model, so that even if the trained text matching model never learns the domain keyword in the prediction sample pair a, the domain keyword in the third text data and the domain keyword in the fourth text data may be continuously and rapidly recognized through the set keyword identifier, and further, feature information represented by the domain keyword may be learned in a focused manner, so that a phenomenon of misclassification may be effectively avoided.
The specific implementation manner of the text processing system acquiring the enhanced text data and training the initial text matching model may refer to the following embodiments corresponding to fig. 3 to 9.
For easy understanding, please refer to fig. 3, which is a flowchart illustrating a text data processing method according to an embodiment of the present application. The text data processing method shown in fig. 3 may be applied to the above-described text processing system, for example, the text processing system 100a shown in fig. 2. The text processing system may include a computer device operating the text data processing apparatus, where the computer device may be configured to execute the text data processing method. The computer device may be a service server, or may be a device such as other terminals, and the like, which is not limited herein. The terminal may specifically include a mobile phone, a tablet Computer, a notebook Computer, and a Personal Computer (PC). As shown in fig. 3, the text data processing method may include at least the following steps S101 to S104.
Step S101, obtaining initial sample data, determining a first keyword in the initial sample data through a domain keyword in a keyword database, and obtaining candidate text data corresponding to a second keyword having an incidence relation with the first keyword;
specifically, the computer device running the text data processing apparatus may obtain initial sample data from a sample labeling area (which may also be referred to as a labeling interval), for example, one sample data may be selected from the sample labeling area as the initial sample data. Further, the computer device may obtain a first domain dictionary from the keyword database, and identify a domain keyword in the initial sample data based on the first domain dictionary; further, the computer device may take a domain keyword identified in the initial sample data as a first keyword; further, the computer device may obtain a target associated text containing a first keyword from associated texts contained in the keyword database, and take a domain keyword in the target associated text as a second keyword; further, the computer device may take the target associated text containing the second keyword as candidate text data corresponding to the second keyword having an association relationship with the first keyword.
When the computer device running the text data processing apparatus acquires initial sample data from a sample labeling area (which may also be referred to as a labeling interval), the computer device may further identify a field to which the initial sample data belongs, determine the identified field to which the initial sample data belongs as a first field, further acquire associated text matched with a field label of the first field from a text database (i.e., the open field), further extract and collect field keywords matched with the first field from a large number of acquired associated texts, and construct a first field dictionary corresponding to the first field based on the field keywords. It should be understood that, after the first domain dictionary corresponding to the first domain (i.e., the current domain) is constructed, the second domain dictionaries corresponding to other domains (i.e., the second domains) in the same business scenario may also be obtained, and then the keyword database associated with the sample labeling area may be finally determined based on the domain dictionaries of these domains.
For easy understanding, please refer to fig. 4, which is a schematic view of a scenario for determining candidate text data according to an embodiment of the present application. The text database shown in fig. 4 may be used to store text data that needs to be subjected to text matching and is obtained from the network database, for example, articles or information that needs to be subjected to text matching. Optionally, the text database shown in fig. 4 may also be used to receive and store text data uploaded by a user, for example, electronic readings uploaded by the user, such as electronic books, electronic novels, and other electronic book data, may be received.
It should be understood that the labeled region 101a shown in fig. 4 may contain the sample data 201a shown in fig. 4, and may also contain other sample data (not shown in fig. 4). It is understood that each sample data located in the labeling area 101a carries real classification label information, where the classification label information refers to a domain label of the domain to which each sample data belongs, such as a sports news label, a financial news label, a. For convenience of understanding, in the embodiment of the present application, the labeled section 101a shown in fig. 4 is taken as an example of the sample labeled area 20a in the embodiment corresponding to fig. 2, a field to which the sample data 201a in the labeled section 101a belongs may be a news field, and in the embodiment of the present application, the sample data 201a shown in fig. 4 may be collectively referred to as the above initial sample data, that is, the initial sample data may be text data which is selected from the sample labeled area (for example, the sample standard area 20a) and carries a corresponding field tag.
It is understood that, when the computer device selects the sample data 201a from the labeled interval 101a shown in fig. 4 as the initial sample data, the domain to which the sample data 201a belongs may be identified as a news domain, and a domain tag, for example, a sports news tag, of the domain to which the sample data 201a belongs may be identified. At this time, the computer device may use the identified news domain as the first domain, and further may select an associated text matching a domain tag (e.g., a sports news tag) of the first domain from the text database shown in fig. 4, specifically, please refer to the multiple associated texts shown in fig. 4, where the multiple associated texts may specifically include an associated text 201b, an associated text 201c, and an associated text 201 d.
It is to be understood that the text database shown in fig. 4 may also include other fields besides the news field (i.e., the first field), that is, the other fields in the text database (e.g., the conversation field, the service field, etc.) may be collectively referred to as the second field in the embodiments of the present application. In this way, the specific process of the computer device in filtering the candidate text data shown in fig. 4 through the plurality of associated texts shown in fig. 4 can be described as follows: the computer device may determine a first keyword in the initial sample data by using a domain keyword in the constructed keyword database, and obtain candidate text data corresponding to a second keyword having an association relationship with the first keyword, where the candidate text data may be candidate text data corresponding to the second keyword having an association relationship with the first keyword, which is selected from a plurality of associated texts shown in fig. 4.
It is to be understood that, before acquiring the candidate text data, the computer device may pre-construct a first domain dictionary corresponding to the first domain. For example, the computer device may screen a plurality of associated texts shown in fig. 4 from a text database storing a large amount of text data, and may further mine candidate words in the associated texts obtained from the text database to find a domain keyword having a higher influence in the first domain. It can be understood that, when some candidate words with higher cross-correlation degree are preliminarily mined, the mined candidate words (i.e., the screened candidate words with the cross-correlation degree greater than the cross-correlation threshold) may be further used as character strings to be processed, and then the influence degree of each character string to be processed may be obtained by analyzing the frequency (e.g., frequency 1) of the documents in which the character strings to be processed appear in the first domain (i.e., the own domain) and the frequency (e.g., frequency 2) of the documents in which the character strings to be processed appear in the second domain. Therefore, when the influence degree of each character string to be processed is calculated, the functions of the character strings to be processed in the own field can be balanced, and a first field dictionary corresponding to the first field can be constructed. It can be understood that in the process of calculating the influence of the character string to be processed according to the embodiment of the present application, a plurality of text data related to the first field need to be considered, and text data of other fields unrelated to the first field need to be considered together, so as to ensure high-quality field keywords that can be screened out.
For easy understanding, please refer to fig. 5, which is a schematic view of a scene for constructing a domain dictionary according to an embodiment of the present application. The associated text 201b, the associated text 201c,. and the associated text 201d shown in fig. 5 may be a plurality of associated texts in the embodiment corresponding to fig. 4. It is understood that each associated text may contain a plurality of characters, and embodiments of the present application may collectively refer to each character in the associated text as a word segmentation of the associated text. It is understood that, in order to sufficiently extract high-quality domain keywords (i.e., keyword information) from a large amount of associated texts, the embodiment of the present application may perform a word segmentation process on each associated text shown in fig. 5 to obtain a word segmentation set associated with the word segmentation of the associated text.
As shown in fig. 5, the computer device integrated with the text data processing apparatus may further combine a plurality of consecutive segmented words in the associated text to obtain a plurality of candidate words shown in fig. 6, where one candidate word is a combined word, and these candidate words (i.e., the combined word) may specifically include candidate word X1, candidate word X2, candidate word X3, candidate word X4, candidate word X5, candidate word X6, candidate word X7,.. or.x 8 shown in fig. 5. It should be understood that, when the candidate words are obtained by combining, the computer device may determine a cross-correlation degree between each participle in the candidate words, where the cross-correlation degree may be used to describe a collocation strength when each participle in the associated text is combined and collocated. Further, the computer device may screen candidate words from the candidate words shown in fig. 5 whose cross-correlation degrees are greater than a cross-correlation threshold value based on the cross-correlation degrees of the candidate words. As shown in fig. 5, in the embodiment of the present application, candidate words screened out from the candidate words and having a cross-correlation degree greater than a cross-correlation threshold value may be collectively referred to as a to-be-processed character string, so as to obtain a to-be-processed character string X1, a to-be-processed character string X2, a to-be-processed character string X3, a to-be-processed character string X4, a to-be-processed character string X5, a to-be-processed character string X6, a to-be-processed character string X7, and a to-be-processed character string X8.
For convenience of understanding, in the embodiment of the present application, the associated text is taken as the associated text 201b shown in fig. 5 as an example, and if the associated text 201b is "how the basketball level of nauzun is like", each character in the associated text 201b may be split, and each split character (for example, "nauz", "xu", "kun", "the" of., "and" the "of.") may be collectively referred to as a participle of the associated text 201 b. It is understood that the participle set formed by the participles of the associated text 201b may be a sub-participle set 1. By analogy, the computer device may also split each participle in the other associated texts (i.e., the associated texts 201c,. and 201d) shown in fig. 5 to obtain a sub-participle set 2 formed by the participles of the other associated texts. It should be understood that the sub-word segmentation set 1 corresponding to the associated text 201c and the sub-word segmentation set 2 corresponding to other associated texts shown in fig. 5 may be collectively referred to as a word segmentation set associated with the word segmentation of the associated text. For a specific way of splitting the word segmentation of the other associated texts by the computer device, reference may be made to the description of the associated text 201b, which will not be further described here.
Further, in order to fully exploit the collocation strength between the participles in each participle set, the embodiment of the application may perform participle combination on each participle in the participle set. For example, in the above word segmentation set, a plurality of continuous word segmentation in each associated text shown in fig. 5 may be combined to obtain a candidate word shown in fig. 5. In other words, in the embodiment of the present application, each participle in the participle set may be combined (for example, participles in multiple consecutive positions may be combined in one associated text) to obtain a candidate word associated with the associated text.
For example, for the associated text 201b being "how to look like a basketball level of chua xun", the candidate words in the participle set 1 of the participle set may specifically include: chua Xun, Chua Xunkn, how, etc. Further, the computer device may calculate a cross-correlation degree between each of the candidate words in the set of participles 1. It is understood that the cross correlation can be used to characterize the matching strength between the participles in the candidate words. If the calculated cross-correlation value is larger, the strength of matching among several continuous participles which participate in building the candidate word can be indirectly reflected to be larger, so that the possibility that the participles can become the domain keywords in the first domain (namely the current domain) is higher. For example, if the cross-correlation degree of the candidate word "zaa xun" is greater than the cross-correlation threshold in the keyword screening condition in the cross-correlation degrees corresponding to the candidate words, the "zaa xun" may be used as the character string to be processed screened from the candidate words in the participle set 1. By analogy, in the embodiment of the present application, the character string to be processed may also be screened from the candidate words corresponding to other associated texts, and here, a detailed process of screening the character string to be processed from the candidate words in the word segmentation set 2 will not be described in detail.
It is understood that, after obtaining the character strings to be processed, the embodiment of the present application may add the character strings to the dictionary U1 shown in fig. 5, and then, in the case of introducing text data that is not related to the first domain (for example, text data related to the second domain that exists in the text database shown in fig. 4 described above), count the frequency of occurrence of each character string to be processed in the first domain (i.e., the above-mentioned document frequency) and the frequency of occurrence of the corresponding character string to be processed in the second domain (i.e., the document frequency), and then obtain the influence degree of the character strings to be processed in the first domain. As shown in FIG. 5, the influence of the string X1 to be processed is y1The influence degree of the character string X2 to be processed is y2The influence degree of the character string X3 to be processed is y3The influence degree of the character string X8 to be processed is y8
As shown in fig. 5, the embodiment of the present application may further determine whether the influence of the to-be-processed character strings reaches an influence threshold (e.g., 0.85) in the keyword screening condition, and if there is a to-be-processed character string with an influence greater than the influence threshold in the influence of the to-be-processed character strings, the to-be-processed character string with an influence greater than the influence threshold may be used as a domain keyword (e.g., the domain keyword X5 and the domain keyword X8 shown in fig. 5) matching a first domain (e.g., the above-mentioned news domain).
Further, after the computer device finds the domain keyword with a higher influence degree from the large number of candidate words, the initial domain dictionary may be updated by using the domain keyword with a higher influence degree (i.e., the dictionary U1 shown in fig. 5 may be updated), so as to obtain the dictionary U2 shown in fig. 5, where the dictionary U2 may be the first domain dictionary corresponding to the first domain constructed in the embodiment of the present application.
Optionally, after the domain keywords with a higher influence degree are obtained, the domain keywords may be ranked according to influence degrees of the domain keywords in the first domain, and then the ranked domain keywords may be assembled to construct the first domain dictionary.
Further, it is understood that, after obtaining the domain dictionary of the first domain, the computer device may further obtain domain dictionaries of other domains that are not related to the first domain, and may collectively refer to the domain dictionaries of the other domains (e.g., the second domain) as the second domain dictionary, and may further reconstruct the keyword database capable of automatically identifying the domain keywords in the sample data based on the domain dictionaries of these domains (i.e., the first domain dictionary and the second domain dictionary). For a specific construction manner of the second domain dictionary, reference may be made to the above description of constructing the first domain dictionary, and details will not be further described here.
In view of this, after the keyword database is constructed, the computer device may further quickly identify the domain keyword in the sample data 201a (i.e., the initial sample data) shown in fig. 4 through the constructed keyword database, and may use the identified domain keyword as the first keyword of the initial sample data (for example, the domain keywords such as zakhun, basketball level, and the like may be collectively referred to as the first keyword), and may further obtain the target associated text including the first keyword from the plurality of associated texts shown in fig. 4, so as to collectively refer to the domain keyword in the obtained target associated text as the second keyword. It is understood that, in the embodiment of the present application, the number of the second keywords may be greater than or equal to the number of the first keywords.
For example, in a question and answer scenario corresponding to the in-vehicle application, the initial sample data may be a question text a (for example, a driving route from a science park to a public security gym) entered by a user through a voice mode, and candidate text data having an association relationship with the initial sample data is found from the plurality of association texts: may contain the same question text a1 as the initial sample data, and may also contain other question samples similar to the initial sample data (e.g., question sample a 2.., question sample A3). It is to be understood that the question text a1, the question sample a2, the question sample a 3526, and the question sample A3 may include the first keyword (e.g., a science park, a baean gym), embodiments of the present application may collectively refer to the related samples including the first keyword as a target related text determined from the plurality of related texts, may refer to the field keyword in the target related text (i.e., the question text a1, the question sample a2, the question sample A3) as the second keyword, and may refer to the target related text including the second keyword as candidate text data corresponding to the second keyword having a relationship with the first keyword. As shown in fig. 4, the candidate text data corresponding to the second keyword having an association relationship with the first keyword may specifically include the associated texts 201c,. and 201d shown in fig. 4.
Step S102, determining the association degree between the initial sample data and the candidate text data, screening the candidate text data of which the association degree meets the sample screening condition from the candidate text data, and taking the screened candidate text data as the enhanced text data corresponding to the initial sample data;
specifically, the computer device running the text data processing apparatus may determine the association degree between the initial sample data and the candidate text data according to the coverage ratio between the first keyword in the initial sample data and the second keyword in the candidate text data; further, the computer device may rank the association degrees in the candidate text data to obtain text data to be processed corresponding to the candidate text data; further, the computer device can screen the text data to be processed with the association degree greater than a first association threshold value and less than a second association threshold value from the sorted text data to be processed; further, the computer device may use the screened text data to be processed as enhanced text data corresponding to the initial sample data; it is understood that the first correlation threshold in the embodiment of the present application may be smaller than the second correlation threshold, and both the first correlation threshold and the second key threshold are thresholds in the sample screening condition. In other words, in order to improve the resolution capability of the text matching model, negative sample data that can cause strong interference to the initial sample data needs to be screened from the candidate text data, so that the following step S103 may be performed subsequently.
It should be understood that, for example, the embodiment of the present application can improve the resolving power of the text matching model, and it is proposed that data enhancement processing may be performed on given training sample data (i.e., the above initial sample data) to generate more training sample data. For example, the data enhancement module in the embodiment corresponding to fig. 2 performs data enhancement processing on training sample data, so that the manual interaction time for manually selecting negative samples can be reduced, and some good or high-quality negative sample data can be obtained quickly. From the perspective of text matching, the embodiments of the present application may collectively refer to these obtained "high quality" negative sample data as enhanced text data, that is, the enhanced text data may refer to some text data "similar to" the initial sample data, but having keywords in some areas different from the initial sample data.
For example, taking initial sample data as "zhou jie singing is very good to hear", after performing data enhancement processing on the initial sample data, the obtained enhanced text data may include: "Zhou Hua Jian sings very nice, and" Zhou Jiu Lun sings very hard "etc. By comparison, it can be found that there are some commonalities between the initial sample data and the enhanced text data, i.e. the sentence patterns of the two text data in the text pair are similar and have the same domain keywords, such as singing. The embodiment of the present application may collectively refer to the same domain keywords in the two text data as common keywords.
In other words, the computer device can quickly mark out the first keyword in the initial sample data and can also mark out the second keyword in the candidate text data in the keyword database formed by the domain dictionaries of the plurality of domains. It is to be understood that, in the embodiment of the present application, all the domain keywords in the initial sample data may be collectively referred to as first keywords, for example, the number of the first keywords may be 4, and the domain keywords in the candidate text data may be collectively referred to as second keywords, for example, the number of the second keywords may be 6. It should be understood that, when the computer device calculates the association degree of the initial sample data and each candidate text data, the computer device may count the number of common keywords between the two text data, for example, the number of common keywords is 3, and at this time, the coverage ratio between the first keyword calculated by the computer device and the second keyword in the candidate text data may be 30%. It is understood that, in the embodiments of the present application, the calculated coverage ratio may be collectively referred to as a degree of association between the initial sample data and the candidate text data. For example, the association degree between the sample data 201a and the candidate text data 201c may be determined to be association degree 1 by calculating a coverage ratio between the first keyword of the sample data 201a and the second keyword of the candidate text data 201 c. By analogy, by calculating a coverage ratio between the first keyword of the sample data 201a and the second keyword of other candidate text data (for example, the candidate text data 201e, which is not shown in fig. 4), it may be determined that the association degree between the sample data 201a and the candidate text data 201e is association degree 2. For another example, the degree of association between the sample data 201a and the candidate text data 201d may be determined to be the degree of association 3 by calculating the coverage ratio between the first keyword of the sample data 201a and the second keyword of the candidate text data 201 d.
Further, the computer device may rank the relevance degrees of the candidate text data, and may further collectively refer to the candidate text data after the relevance degrees are ranked as the text data to be processed, for example, if the relevance degree 2 is greater than the relevance degree 1, and the relevance degree 2 is less than the relevance degree 3, that is, the relevance degree 3> the relevance degree 2> the relevance degree 1, the queue order of the ranked text data to be processed may be: candidate text data 201d, candidate text data 201e, and candidate text data 201 c. At this time, the computer device screens the text data to be processed, which has a degree of association greater than the first association threshold and less than the second association threshold, from among the text data to be processed, as enhanced text data, to further perform step S103 described below.
It can be understood that, in the process of determining the enhanced text data, an upper threshold and a lower threshold may be preset, where the upper threshold may be the second association threshold, and the lower threshold may be the first association threshold, and the first association threshold and the second association threshold may be used to reflect a requirement of a proportion of a common association threshold in two sample data in a total association threshold. If the coverage ratio of two sample data (for example, the coverage ratio Q1) is greater than or equal to the second correlation threshold, it may be stated that the candidate text data (for example, the candidate text data 201d) corresponding to the coverage ratio Q1 is consistent with the initial sample data, and therefore, the candidate text data 201d may be used as a positive sample in the model training phase, so that the candidate text data 201d needs to be discarded. On the contrary, if the coverage ratio of two sample data (for example, the coverage ratio Q2) is less than or equal to the second correlation threshold, it may be indicated that the candidate text data (for example, the candidate text data 201c) corresponding to the coverage ratio Q2 is not consistent with the initial sample data, and therefore, the candidate text data 201d may be regarded as a negative sample which is easy to be distinguished, and the candidate text data 201c also needs to be discarded.
In view of this, it can be understood that, in the embodiment of the present application, the first association threshold and the second association threshold may be set according to a keyword policy to screen enhanced text data with a higher confusion degree from the plurality of candidate text data, that is, in the embodiment of the present application, candidate text data with similar keywords but different essential contents of the keywords may be screened according to the keyword policy to expand the training sample, so as to implement data enhancement on the initial sample data serving as the training sample. Therefore, when the enhanced text data are used as high-quality negative sample data in the model training stage, the resolution capability of the text matching model on the text pair with strong confusion can be greatly improved.
Step S103, determining a training sample pair having an incidence relation with the keyword database according to the enhanced text data and the initial sample data;
and each sample data in the training sample pair carries a keyword identifier corresponding to the domain keyword in the keyword database. It is understood that, after obtaining the enhanced text data, the embodiments of the present application may collectively refer to each enhanced text data and the initial sample data as the training sample (which may also be collectively referred to as sample data). According to the embodiment of the application, after the keyword database is constructed, the keyword database can be acted on the training sample, the field keywords in the keyword database can be further merged into the text matching model to be trained (namely, the initial text matching model), so that the initial text matching model can quickly capture the field keywords in each sample data in the training sample pair participating in training in the model training stage, and further the captured field keywords can be subjected to keyword identification in the corresponding sample data, for example, the field keywords in each sample data can be highlighted. In addition, it can be understood that, by incorporating the domain keywords into the initial text matching model, the learning ability of the model to the domain keywords in the sample data can be enhanced at the model training stage, and further, after the model training is completed, the classification ability of the trained text matching model to the text pairs can be effectively improved.
Wherein, it should be understood that the training sample pair herein may include a first sample data and a second sample data; the first sample data can contain initial sample data carrying keyword identification; the second sample data may contain enhanced text data carrying keyword identifiers. It can be understood that by setting the keyword identifiers for each sample data in the training sample pair, the capturing capability of the initial text matching model for the field keywords carrying the keyword identifiers can be effectively improved, so that the dependence of the model on manual labeling data can be reduced, and the recognition capability of the model for the field keywords can be improved.
Step S104, training an initial text matching model used for capturing the keyword identification based on the training sample pair, and determining the trained initial text matching model as a target text matching model;
specifically, the computer device can identify the field keywords of the first sample data and the second sample data in the training sample pair by using the initial text matching model, and then can set the keyword identifiers for the first sample data when the field keywords exist in the first sample data, so that the field keywords corresponding to the keyword identifiers in the first sample data are used as the first field keywords. Similarly, when recognizing that the second sample data has the domain keyword, the computer device may set a keyword tag for the second sample data, so as to use the domain keyword corresponding to the keyword identifier in the second sample data as the second domain keyword; further, the computer device may obtain first segmentation feature information of a first segmentation in the first sample data and second segmentation feature information of a second segmentation in the second sample data; the first participle and the second participle may be the above-mentioned character strings to be processed (i.e. compound words) with higher cross-correlation, and will not be limited herein. The first segmentation may include a first domain keyword, and the second segmentation may include a second domain keyword.
Further, the computer device may train the training sample pair based on the first segmentation feature information, the second segmentation feature information, the first domain keyword, the second domain keyword, and the initial text matching model to obtain a training classification result; further, the computer device may determine the trained initial text matching model as the target text matching model when detecting that the training classification result satisfies the classification convergence condition.
The target text matching model can be subsequently used for predicting the matching degree of the obtained prediction sample pair. It can be understood that the initial text matching model used in the embodiment of the present application may intelligently select a text matching model in a corresponding service scenario according to a difference of the service scenario.
For example, in the embodiment of the present application, scenes of large-scale online data (generally over one hundred thousand) and high running speed (within 1 ms) may be collectively referred to as a first service scene, and text matching models in the first service scene may be collectively referred to as a first text matching model, where the first text matching model may be a model fused with domain keywords, for example, a Bidirectional Encoder Representation (BERT) model based on a transformer. Wherein the first text matching model may comprise a keyword attention layer, a fusion layer, and a classification layer. The specific process of training the first text matching model by the computer device can be described as follows: the computer equipment can input the first word segmentation characteristic information of the first word segmentation and the second word segmentation characteristic information of the second domain keyword into the keyword attention layer and output first attention characteristic information corresponding to the keyword attention layer; wherein the first attention feature information can be used for representing the correlation between the second domain keyword and the first segmentation; further, the computer device may input second segmentation feature information of the second segmentation and first segmentation feature information of the first domain keyword into the keyword attention layer to output second attention feature information corresponding to the keyword attention layer; the second attention feature information can be used for representing the correlation between the first domain keyword and the second participle; further, the computer device may obtain first semantic feature information of the first sample data and second semantic feature information of the second sample data to perform semantic fusion on the first semantic feature information and the second semantic feature information to obtain fused semantic feature information; further, the computer device may input the first attention feature information, the second attention feature information, and the fusion feature information into the fusion layer, output a fusion feature vector corresponding to the training sample pair, and output a training classification result of the training sample pair through the classification layer.
For easy understanding, please refer to fig. 6, which is a schematic view of a first text matching model provided in an embodiment of the present application. Sample data a carrying the keyword identifier (i.e. keyword identifier 1) shown in fig. 6 may be the first sample data in the training sample pair, and similarly, sample data B carrying the keyword identifier (i.e. keyword identifier 2) shown in fig. 6 may be the second sample data in the training sample pair. As shown in fig. 6, the first participle in the sample data a may include Token a1, Token a2, Token A3, Token a4, and Token Ai shown in fig. 6, that is, the sample data a may include i first participles, and a value of i may be a positive integer. Similarly, as shown in fig. 6, the second participles in the sample data B may include Token B1, Token B2, Token B3, or Token Bj shown in fig. 6, that is, the sample data B may include j second participles, where j may be a positive integer.
As shown in fig. 6, the computer device may input the first sample data and the second sample data as a training sample pair to the first text matching model shown in fig. 6 to train the first text matching model. The first text matching model shown in fig. 6 may include a feature extraction layer (not shown in fig. 6), a keyword attention layer, a fusion layer, and a classification layer, among others. It can be understood that, since the above-mentioned domain keyword is merged into the first text matching model (i.e. the text matching model in the first service scenario), when the computer device gives the sample data a carrying the keyword identifier 1 and the sample data B carrying the keyword identifier 2 to the first text matching model shown in fig. 6, the first text matching model may collectively refer to the domain keyword (i.e. a1 shown in fig. 6) corresponding to the keyword identifier 1 in the sample data a as the first domain keyword, and refer to the domain keyword (i.e. a2 shown in fig. 6) corresponding to the keyword identifier 2 in the sample data B as the second domain keyword. It is understood that the first domain keyword may include, but is not limited to, a1 in sample data a, and the second domain keyword may include, but is not limited to, B1 in sample data B, where the domain keywords in the two sample data are not listed.
For example, the first participle feature information of Token a1 may be represented as h (a1), the first participle feature information of Token Ai may be represented as h (Ai), and the second participle feature information of Token B1 may be represented as h (B1), the second participle feature information of Token Bj may be represented as h (Bj) (h), it should be understood that the embodiment of the present application may extract the first participle feature information and the second participle feature information of each participle in corresponding sample data, and the extracted semantic feature information may be referred to as semantic extraction feature extraction information, semantic extraction feature extraction.
Wherein a keyword attention layer in the first text matching model may be used to enhance the cross-correlation information between two sample data. For example, the computer device may input the first segmentation feature information (i.e., h (a1),.. multidot.h (ai)) of the first segmentation and the second segmentation feature information (i.e., h (B1)) of the second domain keyword (i.e., Token B1 shown in fig. 6) into the keyword attention layer shown in fig. 6 to obtain first attention feature information for characterizing the correlation between the second domain keyword and each first segmentation in the first sample data, where the first attention feature information may be h (a) shown in fig. 6. Similarly, the computer device may further input the second segmentation feature information (i.e., h (B1),.. times, h (bj)) of the second segmentation and the first segmentation feature information (i.e., h (a1)) of the first domain keyword (i.e., Token a1 shown in fig. 6) into the keyword attention layer shown in fig. 6 to obtain second attention feature information for characterizing the correlation between the first domain keyword and each second segmentation in the second sample data, where the second attention feature information may be h (B) shown in fig. 6.
It can be understood that the vector dimension of h (a) is consistent with the vector dimension of h (ai), and can be used to characterize a sentence vector after the second domain keyword in the second sample data is fused with the sentence of the first sample data. The vector dimension of h (b) is consistent with the vector dimension of h (bj), and can be used for representing a sentence vector obtained by fusing the first domain keyword in the first sample data with the sentence of the second sample data.
As shown in fig. 6, the computer device inputs the first attention feature information (i.e., h (ai) shown in fig. 6), the second attention feature information (i.e., h (bj) shown in fig. 6), and the fusion feature information (i.e., h (C L S) shown in fig. 6) into the fusion layer in the first text matching model to output the fusion feature vectors corresponding to the training sample pairs (i.e., sample data a and sample data B shown in fig. 6), and further may output the matching degree between the sample data a and the sample data B through the classification layer in the first text matching model to obtain the training classification result of the training sample pairs.
Optionally, in the embodiment of the present application, scenes of online small-scale data (generally less than one hundred thousand) and medium running speed (delayed by 1ms to 10ms) may also be collectively referred to as a second service scene, and text matching models in the second service scene may also be collectively referred to as a second text matching model, where the second text matching model may be a model merged into the domain keyword. For example, a Fastpair model can be used for text matching of text pairs. The second text matching model comprises a feature combination layer, an average pooling layer, a full connection layer and a classification layer. The first participles comprise M first sub-participles except the first field keywords; the second participles can contain N second sub-participles except the second domain keyword; wherein, M and N can be both positive integers; at this time, the specific process of the computer device to train the second text matching model may be described as follows: the computer device can determine first sub-position information of the M first sub-participles and second sub-position information of the first domain keyword in the first sample data based on the first participle feature information, and can obtain first autocorrelation feature information of a first autocorrelation word formed by the M first sub-participles and the first domain keyword in the first sample data according to the M first sub-position information and the second sub-position information; further, the computer device may determine, in the second sample data, third sub-position information of the N second sub-participles and fourth sub-position information of the second domain keyword based on the second participle feature information, and obtain, in the second sample data, second autocorrelation feature information of a second autocorrelation word composed of the N second sub-participles and the second domain keyword according to the N third sub-position information and the fourth sub-position information; further, the computer device may obtain interactive feature information corresponding to the cross-correlation word between the first sample data and the second sample data; further, the computer device may use the first autocorrelation feature information, the second autocorrelation feature information, and the interaction feature information as input features of the average pooling layer, output pooling vectors corresponding to the average pooling layer, and train the training sample pairs according to the pooling vectors, the full-link layer, and the classification layer to obtain training classification results.
For easy understanding, please refer to fig. 7, which is a schematic view of a second text matching model provided in an embodiment of the present application. The sample data a shown in fig. 7 may be the first sample data, and the sample data B may be the second sample data. The first domain keyword in the first sample data may include Token a1 shown in fig. 7, and in the embodiment of the present application, the first tokens except for the first domain keyword may be collectively referred to as first sub-tokens in the first sample data, the number of the first self-tokens may be M, where M is a positive integer, for example, as shown in fig. 7, the M first tokens, which are Token a2 to Token ai, may be collectively referred to as first sub-tokens. Similarly, the second domain keyword in the second sample data may include Token B1 shown in fig. 7, and in the embodiment of the present application, the second participles other than the second domain keyword may be collectively referred to as second sub-participles in the second sample data, the number of the second self-participles may be N, N is a positive integer, for example, as shown in fig. 7, the N second participles including Token B2 to Token Bj may be collectively referred to as second sub-participles.
As shown in fig. 6 described above, the computer device may acquire the position feature, the mark feature, and the word feature of the first segmented words in the first sample data after extracting the first segmented feature information of the first segmented words in the first sample data. Therefore, the computer device in the embodiment of the present application may determine, based on the first segmentation characteristic information, first sub-location information of the M first sub-segments (for example, the first sub-location information of Token Ai may be the ith location of the sample data a shown in fig. 7) and second sub-location information of the first domain keyword (for example, the second sub-location information of Token a1 may be the first location of the sample data a shown in fig. 7). Similarly, after extracting the second segmentation feature information of the second segmentation in the second sample data, the computer device may acquire the position features, the mark features and the word features of the second segmentation in the second sample data. Therefore, the computer device may determine third sub-position information of the above-mentioned N second sub-participles (for example, the third sub-position information of Token Bj may be the jth position of the sample data B shown in fig. 7) and fourth sub-position information of the second domain keyword (for example, the fourth sub-position information of Token B1 may be the first position of the sample data B shown in fig. 7) based on the second participle feature information. It can be understood that, in the embodiment of the present application, by determining the position information of each participle in the corresponding sample data, when performing participle combination, the participles having the same text content and having different sources can be distinguished, so as to ensure the diversification of the feature information in the model training stage.
In the second service scenario, it can be understood that, in the first sample data, the computer device may perform word segmentation combination on the M first sub-participles and the first domain keyword according to the M first sub-position information and the second sub-position information, and may use a combined word obtained after the word segmentation combination as the first auto-related word of the first sample data. For example, if the first sample data is "i love eating apple", the first segmentation corresponding to the first sample data may be Token a1 (me), Token a2 (love), Token A3 (eat), and Token a4 (apple). Wherein, "apple" may be a first domain keyword in the first sample data, and "i", "love", "eat" may be 3 (i.e., M ═ 3) first sub-participles in the first sample data. At this time, the computer device may perform word segmentation combination on the 3 first sub-participles and the first domain keyword based on the position information of the first participles in the first sample data, and may further obtain the following first auto-related word: a1 ═ me ", A1A2 ═ me, A2A3 ═ love, eat, and A3a4 ═ eat, apple.
Similarly, the computer device may perform word segmentation combination on the N second sub-participles and the second domain keyword in the second sample data according to the N third sub-position information and the fourth sub-position information, and use a combined word obtained after the word segmentation combination as a second auto-related word of the first sample data. For example, the second sample data may be "i like eating pears", and the second participle corresponding to the second sample data may be TokenB1 (i), TokenB2 (like), TokenB3 (eating), and TokenB4 (pears). Here, "pear" may be a second domain keyword in the second sample data, and "i", "like" and "eat" may be 3 (i.e., N ═ 3) second sub-participles in the second sample data. At this time, the computer device may perform word segmentation combination on the 3 second sub-participles and the second domain keyword based on the position information of the second participles in the first sample data, and may further obtain the following second auto-related words: b1, B1B2, B2B3, and B3B4, respectively, i.e., my, like, and i.e., eat, pear, respectively.
In order to understand the cross-correlation between the first sample data and the second sample data, in the embodiment of the present application, a word segmentation combination may be performed on a first word segmentation (i.e., M first sub-words and a first domain keyword) and a second word segmentation (i.e., N second sub-words and a second domain keyword), and a combined word obtained after the word segmentation combination may be used as a cross-correlation word between the first sample data and the second sample data. For example, taking the first sample data as "i love eating apple" and the second sample data as "i like eating pear", the cross-related words obtained after interactive combination may include: A1B1, A1B2, A1B3, A1B4, A2B1, an A2B2, A2B3, an A2B4, an A3B1, an A3B2, an A3B3, an A3B4, an A4B1, an A4B2, an A4B3, an apple, an A4B4, an apple, a pear.
It is to be understood that, when obtaining the cross-related words, the computer device may further identify, through the second text matching model, whether there are cross-related words having the same content as the first cross-related word and having different sources in the cross-related words, and if there are cross-related words having the same content as the first sub-funeral lyrics, perform feature identification on the cross-related words in the cross-related words to obtain a first word segmentation identification, for example, A2B3 in the above cross-related words is "love, eat" has the same content as A2A3 in the first cross-related words. At this point, the computer device may set a signature (e.g., #) for A2B3 ═ love, eat, to get a first signature participle (e.g., # A2B 3). The interactive feature information corresponding to the first labeled participle may be collectively referred to as feature information 80c shown in fig. 7. It can be understood that, when the subsequent computer device performs vector encoding on the combined feature information, even if the content of the respective A2B3 and A2A3 is the same due to the fact that the "#" flag is set, the feature vectors corresponding to the two participles will not be the same because the interactive feature information corresponding to A2B3 is different from the first sub-related feature information corresponding to A2 A3. For example, the feature vector of A2B3 may be word vector a and the feature vector of A2A3 may be word vector B. Similarly, if the second text matching model shown in fig. 7 identifies that there is a cross-related word (e.g., A1B2 ═ i, like) having the same content as the second auto-related word (e.g., B1B2 ═ i, like) among the cross-related words, the identified cross-related word identical to the second auto-related word may be subjected to feature identification among the cross-related words, resulting in a second identified participle (e.g., # A1B 2); it can be understood that the interactive feature information corresponding to the second identified participle is different from the second autocorrelation feature information of the second autocorrelation word corresponding to the second identified participle.
It should be understood that the above-mentioned first auto-related word, second auto-related word and cross-related word obtained by combining the participles may be collectively referred to as a combined word in the embodiments of the present application. In these compound words, the first auto-correlation feature information corresponding to the first auto-correlation word may be the feature information 80a shown in fig. 7, the second auto-correlation feature information corresponding to the second auto-correlation word may be the feature information 80b shown in fig. 7, and the cross-correlation feature information corresponding to the cross-correlation word may be the feature information 80c shown in fig. 7. As shown in fig. 7, in the embodiment of the present application, of the feature information 80a, the feature information 80b, and the feature information 80c shown in fig. 7, for convenience of understanding, the initial text matching model is used as the feature information shown in fig. 7. Since the domain keywords are merged into the second text matching model, in order to enable the second text matching model to sufficiently learn the differences of the domain keywords in the corresponding sample data, the feature information shown in fig. 7 may be divided according to the domain keywords in the embodiment of the present application, so that the first combined feature information and the second combined feature information shown in fig. 7 may be obtained.
The first combined feature information may be feature information corresponding to a combined word including a first domain keyword and a second domain keyword, which is screened out by the computer device from a first auto-correlation word corresponding to the first auto-correlation feature information (i.e., feature information 80a), a second auto-correlation word corresponding to the second auto-correlation feature information (i.e., feature information 80b), and a cross-correlation word corresponding to the interaction feature information (i.e., feature information 80 c). It is to be understood that the embodiment of the present application may also determine, as the second combined feature information, the feature information shown in fig. 7, which is the feature information other than the first combined feature information. That is, in the embodiment of the present application, a combined word except for the first domain keyword and the second domain keyword in the first auto-related word, the second auto-related word, and the related word may be used as a second classified combined word, so that feature information corresponding to the second classified combined word may be collectively referred to as second combined feature information shown in fig. 7.
As shown in fig. 7, the computer device may further obtain a first feature vector corresponding to the first combined feature information and a second feature vector corresponding to the second combined feature information by means of vector coding. As shown in fig. 7, the first feature vector may include a word vector of a plurality of combined words related to the domain keyword, and the second feature vector may include a word vector of a plurality of combined words unrelated to the domain keyword. It can be understood that, in the embodiment of the present application, in the model training stage, the vector value in the second feature vector may be adjusted, and the vector value of the adjusted second feature vector may be used as a first fixed value, and then the vector value of the first feature vector and the first fixed value may be used together as a first model parameter of the text matching model in the second service scenario. Then, the computer device may input the first feature vector and the adjusted second feature vector into an average pooling pool corresponding to the first model parameter shown in fig. 7 to output a first pooling vector corresponding to the average pooling layer, and may train the training sample pair according to the first pooling vector, the full connection layer, and the classification layer to obtain a training classification result corresponding to the first model parameter.
As shown in fig. 7, when the training classification result corresponding to the first model parameter indicates that the first model parameter does not satisfy the convergence condition, the computer device may further notify the second text matching model shown in fig. 7 to adjust the model parameter, for example, vector values in the first feature vector may be further adjusted, and a vector value of the adjusted first feature vector may be used as a second fixed value, and the second fixed value and the first fixed value may be further used as a second model parameter of the second text matching model. Then, the computer device may input the adjusted first feature vector and the adjusted second feature vector into an average pooling pool corresponding to the second model parameter to output a second pooling vector corresponding to the average pooling layer, and further, the computer device may train the training sample pair according to the second pooling vector, the full connection layer, and the classification layer to obtain a training classification result corresponding to the second model parameter.
Similarly, if the training classification result corresponding to the second model parameter indicates that the convergence condition is not satisfied, the second text matching model shown in fig. 7 may be continuously notified to adjust the model parameter, that is, the vector value in the adjusted first feature vector may be continuously adjusted to obtain a new second fixed value, and this may be alternately notified to the second text matching model to continuously adjust the vector value of the adjusted second feature vector shown in fig. 7 to obtain a new first fixed value until the number of times that the computer device has counted that the vector value of the first feature vector is adjusted reaches the adjustment threshold.
It can be understood that, in the embodiment of the present application, by adjusting the model parameters of the second text matching model, the second text matching model can learn the vector values in the feature vectors, and then when the training classification result of the second text matching model meets the above classification convergence condition, the trained second text matching model (i.e., the initial text matching model) can be determined as the target text matching model, so that the matching of the obtained text pair can be predicted quickly and accurately through the target text matching model in the subsequent model prediction stage.
When initial sample data is obtained, keywords in the initial sample data can be identified through domain keywords in a keyword database, further candidate text data corresponding to second keywords having an incidence relation with the first keywords can be automatically screened and obtained based on the identified keywords, and in order to improve the classification capacity of the initial text matching model on predicted text pairs, candidate text data which can cause strong interference to the initial sample data can be screened from the candidate text database based on the incidence degree between the initial sample data and the candidate text data, so that the screened subsequent text data can be used as enhanced text data corresponding to the initial sample data, and therefore the identification capacity of the initial text matching model on the text pairs can be enhanced in the model training process. In addition, the initial matching model with the keyword identification capturing capability is introduced, so that the field keywords with the keyword identifications in the training text pairs can be effectively obtained, and the accuracy of text matching can be further improved.
Further, please refer to fig. 8, which is a flowchart illustrating a text data processing method according to an embodiment of the present application. The text data processing method may be applied to the text processing system in the embodiment corresponding to fig. 1, a computer device in the text processing system may be configured to execute the text data processing method, where the computer device may be the service server 2000 in the embodiment corresponding to fig. 1, and the text data processing method may include the following steps S201 to S215.
Step S201, obtaining initial sample data;
the initial sample data is text data in a sample labeling area, and the sample labeling area is an area in a text database which has an incidence relation with the initial sample data.
Step S202, determining the field to which the initial sample data belongs as a first field in a sample labeling area, and acquiring an associated text matched with a field label of the first field from a text database;
wherein the text database comprises a second domain except the first domain. The associated text in the text database (i.e. the open field) may specifically include: the articles or information to be classified acquired from the network database may also include text data uploaded by a user terminal having a network connection relationship with the computer device, such as an electronic book, a video file description, an audio file description, and the like.
Step S203, based on the keyword screening condition associated with the text database, screening and determining a domain keyword matched with the first domain from candidate words formed by word segmentation of the associated text, and constructing a first domain dictionary corresponding to the first domain based on the domain keyword matched with the first domain;
step S204, a second domain dictionary corresponding to the second domain is obtained, and the keyword database related to the sample labeling area is determined based on the first domain dictionary and the second domain dictionary.
Step S205, determining a first keyword in the initial sample data through a domain keyword in the keyword database, and acquiring candidate text data corresponding to a second keyword having an association relationship with the first keyword.
Step S206, determining the association degree between the initial sample data and the candidate text data, screening the candidate text data of which the association degree meets the sample screening condition from the candidate text data, and taking the screened candidate text data as the enhanced text data corresponding to the initial sample data;
step S207, determining a training sample pair having an incidence relation with the keyword database according to the enhanced text data and the initial sample data;
each sample data in the training sample pair carries a keyword identification corresponding to a domain keyword in a keyword database; at this time, the training sample pair in the embodiment of the present application may include the first sample data and the second sample data; the first sample data comprises initial sample data carrying keyword identification; the second sample data comprises enhanced text data carrying keyword identification;
step S208, using the initial text matching model to take the domain keyword corresponding to the keyword identification in the first sample data as the first domain keyword, and take the domain keyword corresponding to the keyword identification in the second sample data as the second domain keyword.
Step S209, acquiring first segmentation characteristic information of a first segmentation in the first sample data and second segmentation characteristic information of a second segmentation in the second sample data;
step S210, training a training sample pair based on the first segmentation feature information, the second segmentation feature information, the first domain keywords, the second domain keywords and the initial text matching model to obtain a training classification result;
it should be understood that the initial text matching model in the embodiment of the present application may include text matching models in a plurality of service scenarios. For example, in the embodiment of the present application, a scenario with a large scale of online data (generally more than one hundred thousand) and a high running speed (within 1ms of delay) may be referred to as a first service scenario, and text matching models in the first service scenario may be collectively referred to as a first text matching model, for example, the first text matching model may be the bert model with keyword identification capture capability. The specific process of training the first text matching model by the computer device may refer to the description of the first text matching model in the embodiment corresponding to fig. 6, and will not be further described here.
For another example, in the embodiment of the present application, a scene with a medium operation speed (delay of 1ms to 10ms) in online small-scale data (generally, less than one hundred thousand) may be referred to as a second service scene, and text matching models in the second service scene may be referred to as a second matching model collectively, for example, the second matching model may be a fast-pair model with the keyword capturing capability. Similarly, the specific process of training the second text matching model by the computer device may refer to the description of the first text matching model in the embodiment corresponding to fig. 7, and will not be further described here.
Therefore, according to the embodiment of the application, the corresponding text matching model can be intelligently selected to perform model training based on different service scenes, and the accuracy of text matching can be improved through the trained text matching model in different service scenes.
And step S211, when the training classification result is detected to meet the classification convergence condition, determining the trained initial text matching model as a target text matching model.
Step S212, acquiring third sample data input by a target user through a target application corresponding to a second service scene;
step S213, screening fourth sample data with the same field label as the third sample data from the text library corresponding to the target application, and taking the third sample data and the fourth sample data as a prediction sample pair;
the fourth sample data is text data in a text database corresponding to the keyword database;
step S214, inputting the prediction sample pair into a target text matching model, and predicting to obtain the matching degree of third sample data and fourth sample data in the prediction sample pair;
step S215, returning the answer text corresponding to the fourth sample data to the user terminal corresponding to the target user based on the matching degree.
For easy understanding, please refer to fig. 9, which is a schematic view of a scene for predicting a matching degree through a target text matching model according to an embodiment of the present application. For convenience of understanding, in the embodiment of the present application, the service scenario is taken as a search service, for example, the user terminal shown in fig. 9 may be the user terminal 3000a in the embodiment corresponding to fig. 1, and the user terminal may run the target application, where the target application may be a search engine (e.g., a QQ browser) for performing text search. As shown in fig. 8, the target user may enter the third sample data shown in fig. 9, which may be a question text such as "how the novice participates in the rank competition glory by the king" in the user terminal running the target application. As shown in fig. 8, when the target application runs on the user terminal, a data interaction relationship may exist between the service server (which may be the computer device) shown in fig. 8. At this time, when receiving the third text information sent by the user terminal, the service server may further filter fourth sample data having a same domain tag (e.g., a game tag) as the third sample data from a text database (e.g., a question database) corresponding to the search engine. It is to be understood that the number of the screened fourth sample data may be one or more, and will not be limited herein. As shown in fig. 9, the service server may input the third sample data and the fourth sample data together as a prediction sample pair to the target text matching model shown in fig. 9, so as to output a matching degree between the third sample data and each fourth sample data.
Further, as shown in fig. 9, the service server (i.e., the computer device) may obtain the matching degree with the maximum value from the matching degrees output by the target text matching model, and further may obtain the answer text corresponding to the fourth sample data with the highest matching degree from the answer database shown in fig. 9. It is understood that, in the above-mentioned open domain, one question text may have a plurality of answer texts, and therefore, as shown in fig. 9, the service server may return a plurality of answer texts to the user terminal. The answer texts may be answer texts 90a, 90b, and 90c shown in fig. 9. As shown in fig. 9, when obtaining the answer texts, the user terminal may sequentially display the answer texts on the text display interface shown in fig. 9 according to the size of the click rate. It is to be understood that, in the embodiments of the present application, the question database and the answer database may be collectively referred to as the text database. For example, the click rate of answer text 90a may be 9.0, the click rate of answer text 90b may be 8.9, and the click rate of answer text 90a may be 8.8.
By analogy, the target text matching model in the embodiment of the application can also be applied to corresponding products in other service scenarios (for example, question and answer scenarios), for example, the target text matching model can be applied to products such as a vehicle-mounted voice system, an intelligent sound box, an intelligent customer service, a child accompanying robot, and intelligent question and answer software. It should be understood that when the question is answered by the target text matching model integrated with the domain keywords, the domain keywords in the prediction samples (i.e., the third sample data and the fourth sample data) participating in prediction can be accurately captured, and then the feature information carrying the domain keywords can be intensively learned to find the fourth sample data with the highest matching degree with the third sample data, so that the answer text with higher accuracy can be obtained.
When initial sample data is obtained, keywords in the initial sample data can be identified through domain keywords in a keyword database, further candidate text data corresponding to second keywords having an incidence relation with the first keywords can be automatically screened and obtained based on the identified keywords, and in order to improve the classification capacity of the initial text matching model on predicted text pairs, candidate text data which can cause strong interference to the initial sample data can be screened from the candidate text database based on the incidence degree between the initial sample data and the candidate text data, so that the screened subsequent text data can be used as enhanced text data corresponding to the initial sample data, and therefore the identification capacity of the initial text matching model on the text pairs can be enhanced in the model training process. In addition, the initial matching model with the keyword identification capturing capability is introduced, so that the field keywords with the keyword identifications in the training text pairs can be effectively obtained, and the accuracy of text matching can be further improved.
Further, please refer to fig. 10, which is a schematic structural diagram of a text data processing apparatus according to an embodiment of the present application. The text data processing device 1 may be applied to a computer device in the text processing system, and the computer device may be the user terminal or the service terminal. Wherein the text data processing apparatus 1 may include: the text data processing apparatus 1 may further include an associated text acquisition module 50, a domain dictionary construction module 60, a keyword library determination module 70, a text entry module 80, a text screening module 90, a matching degree prediction module 100, and a matching text return module 110;
the keyword identification module 10 is configured to obtain initial sample data, determine a first keyword in the initial sample data through a domain keyword in a keyword database, and obtain candidate text data corresponding to a second keyword having an association relationship with the first keyword;
the keyword recognition module 10 includes: a keyword recognition unit 101, a keyword identification unit 102, a target text acquisition unit 103, and a candidate text determination unit 104;
the keyword recognition unit 101 is configured to acquire initial sample data from the sample labeling area, acquire a first domain dictionary from the keyword database, and recognize a domain keyword in the initial sample data based on the first domain dictionary;
a keyword identification unit 102 configured to take a domain keyword identified in the initial sample data as a first keyword;
a target text acquisition unit 103 configured to acquire a target associated text including a first keyword from associated texts included in the keyword database, and use a domain keyword in the target associated text as a second keyword;
and the candidate text determining unit 104 is used for taking the target associated text containing the second keyword as candidate text data corresponding to the second keyword having an association relation with the first keyword.
For specific implementation manners of the keyword recognition unit 101, the keyword identification unit 102, the target text acquisition unit 103, and the candidate text determination unit 104, reference may be made to the description of step S101 in the embodiment corresponding to fig. 3, which will not be further described here.
The association degree determining module 20 is configured to determine an association degree between the initial sample data and the candidate text data, screen candidate text data of which the association degree meets a sample screening condition from the candidate text data, and use the screened candidate text data as enhanced text data corresponding to the initial sample data;
the association degree determining module 20 includes: a relevancy determining unit 201, a relevancy sorting unit 202, a text screening unit 203 to be processed and an enhanced text determining unit 204;
a relevancy determining unit 201, configured to determine a relevancy between the initial sample data and the candidate text data according to a coverage ratio between a first keyword in the initial sample data and a second keyword in the candidate text data;
the relevancy sorting unit 202 is configured to sort the relevancy in the candidate text data to obtain to-be-processed text data corresponding to the candidate text data;
a to-be-processed text screening unit 203, configured to screen, from the sorted to-be-processed text data, to-be-processed text data whose association degree is greater than a first association threshold and smaller than a second association threshold;
an enhanced text determining unit 204, configured to use the screened to-be-processed text data as enhanced text data corresponding to the initial sample data; the first correlation threshold is less than the second correlation threshold, and the first correlation threshold and the second key threshold are both thresholds in the sample screening condition.
For specific implementation manners of the association determining unit 201, the association sorting unit 202, the to-be-processed text screening unit 203, and the enhanced text determining unit 204, reference may be made to the description of the enhanced text data in the embodiment corresponding to fig. 3, and details will not be further described here.
A training pair determining module 30, configured to determine, according to the enhanced text data and the initial sample data, a training sample pair having an association relationship with the keyword database; each sample data in the training sample pair carries a keyword identification corresponding to a domain keyword in the keyword database;
a target model determining module 40, configured to train an initial text matching model for capturing the keyword identifiers based on the training sample pair, and determine the trained initial text matching model as a target text matching model; the target text matching model is subsequently used for predicting the matching degree of the obtained prediction sample pair.
Wherein, the training sample pair comprises first sample data and second sample data; the first sample data comprises initial sample data carrying keyword identification; the second sample data comprises enhanced text data carrying keyword identification; the first sample data comprises initial sample data carrying keyword identification; the second sample data comprises enhanced text data carrying keyword identification;
the object model determination module 40 includes: a domain word recognition unit 401, a word segmentation feature extraction unit 402, a model training unit 403 and a target model determination unit 404;
a domain word recognition unit 401, configured to use the initial text matching model to take a domain keyword corresponding to the keyword identifier in the first sample data as a first domain keyword, and take a domain keyword corresponding to the keyword identifier in the second sample data as a second domain keyword;
a segmentation feature extraction unit 402, configured to obtain first segmentation feature information of a first segmentation in the first sample data and second segmentation feature information of a second segmentation in the second sample data;
a model training unit 403, configured to train a training sample pair based on the first segmentation feature information, the second segmentation feature information, the first domain keyword, the second domain keyword, and the initial text matching model, to obtain a training classification result;
the initial text matching model comprises a text matching model in a first service scene; the text matching model under the first service scene comprises a keyword attention layer, a fusion layer and a classification layer;
the model training unit 403 includes: a first attention output subunit 4031, a second attention output subunit 4032, a semantic feature fusion subunit 4033, and a fusion vector output subunit 4034; optionally, the model training unit 403 may further include: a first feature obtaining subunit 4035, a second feature obtaining subunit 4036, an interactive feature obtaining subunit 4037, a pooled vector output subunit 4038, a first related word determining subunit 4039, a second related word determining subunit 4040, a related word determining subunit 4041, a first feature identification subunit 4042, and a second feature identification subunit 4043;
a first attention output subunit 4031, configured to input the first word segmentation feature information of the first word segmentation and the second word segmentation feature information of the second domain keyword into a keyword attention layer, and output first attention feature information corresponding to the keyword attention layer; the first attention characteristic information is used for representing the correlation between the second domain keyword and the first segmentation;
a second attention output subunit 4032, configured to input second word segmentation feature information of the second word segmentation and first word segmentation feature information of the first domain keyword into the keyword attention layer, and output second attention feature information corresponding to the keyword attention layer; the second attention characteristic information is used for representing the correlation between the first domain keyword and the second participle;
a semantic feature fusion subunit 4033, configured to acquire first semantic feature information of the first sample data and second semantic feature information of the second sample data, perform semantic fusion on the first semantic feature information and the second semantic feature information, and obtain fused semantic feature information;
and a fusion vector output subunit 4034, configured to input the first attention feature information, the second attention feature information, and the fusion feature information into a fusion layer, output a fusion feature vector corresponding to the training sample pair, and output a training classification result of the training sample pair through the classification layer.
Optionally, the first feature obtaining subunit 4035 is configured to determine, in the first sample data, first sub-position information of M first sub-participles and second sub-position information of the first domain keyword based on the first participle feature information, and obtain, according to the M first sub-position information and the second sub-position information, first autocorrelation feature information of a first autocorrelation word formed by the M first sub-participles and the first domain keyword in the first sample data;
a second feature obtaining subunit 4036, configured to determine, in the second sample data, third sub-position information of the N second sub-participles and fourth sub-position information of the second domain keyword based on the second participle feature information, and obtain, according to the N third sub-position information and the fourth sub-position information, second autocorrelation feature information of a second autocorrelation word formed by the N second sub-participles and the second domain keyword in the second sample data;
an interactive feature obtaining subunit 4037, configured to obtain interactive feature information corresponding to the cross-correlation word between the first sample data and the second sample data;
and the pooling vector output subunit 4038 is configured to output a pooling vector corresponding to the average pooling layer by using the first autocorrelation feature information, the second autocorrelation feature information, and the interaction feature information as input features of the average pooling layer, and train the training sample pairs according to the pooling vector, the full link layer, and the classification layer to obtain a training classification result.
Wherein the pooled vector output subunit 4038 includes: a first combined word screening subunit 40381, a second combined word determining unit 40382, a feature vector acquiring subunit 40383, a first adjustment training unit 40384, and a second adjustment training unit 40385;
a first combined word screening subunit 40381, configured to screen a combined word including a first domain keyword and a second domain keyword from a first auto-related word corresponding to the first auto-related feature information, a second auto-related word corresponding to the second auto-related feature information, and a cross-related word corresponding to the interactive feature information, and use the screened combined word as a first classified combined word to obtain first combined feature information corresponding to the first classified combined word;
a second combined word determining unit 40382, configured to use a combined word, excluding the first domain keyword and the second domain keyword, in the first auto-related word, the second auto-related word, and the related word as a second classified combined word, and obtain second combined feature information corresponding to the second classified combined word;
a feature vector obtaining subunit 40383, configured to obtain a first feature vector corresponding to the first combined feature information and a second feature vector corresponding to the second combined feature information;
a first adjustment training unit 40384, configured to adjust a vector value in the second feature vector, use the vector value of the first feature vector and a vector value of the adjusted second feature vector as a first model parameter of a text matching model in a second service scenario, input the first feature vector and the adjusted second feature vector into an average pooling pool corresponding to the first model parameter, output a first pooling vector corresponding to the average pooling layer, and train the training sample pair according to the first pooling vector, the full-link layer, and the classification layer, so as to obtain a training classification result corresponding to the first model parameter;
a second adjustment training unit 40385, configured to adjust a vector value in the first feature vector if the training classification result corresponding to the first model parameter indicates that the first model parameter does not satisfy the convergence condition, use the vector value of the adjusted first feature vector and the vector value of the adjusted second feature vector as a second model parameter of the text matching model in the second service scenario, input the adjusted first feature vector and the adjusted second feature vector into an average pooling pool corresponding to the second model parameter, output a second pooling vector corresponding to the average pooling layer, and train the training sample pair according to the second pooling vector, the full connection layer, and the classification layer, to obtain a training classification result corresponding to the second model parameter.
For specific implementation manners of the first combined word screening subunit 40381, the second combined word determining unit 40382, the feature vector obtaining subunit 40383, the first adjustment training unit 40384, and the second adjustment training unit 40385, reference may be made to the description of the specific process for performing the alternating training on the model in the embodiment corresponding to fig. 3, which will not be described again here.
Optionally, the first related word determining subunit 4039 is configured to perform word segmentation combination on the M first sub-participles and the first domain keyword in the first sample data according to the M first sub-position information and the M second sub-position information, and use a combined word obtained after the word segmentation combination as a first auto-related word of the first sample data;
a second related word determining subunit 4040, configured to perform word segmentation combination on the N second sub-participles and the second domain keyword in second sample data according to the N third sub-position information and the fourth sub-position information, and use a combined word obtained after the word segmentation combination as a second auto-related word of the first sample data;
the cross-correlation word determining subunit 4041 is configured to perform word segmentation combination on the M first sub-segmented words, the first domain keyword, the N second sub-segmented words, and the second domain keyword, and use a combined word obtained after the word segmentation combination as a cross-correlation word between the first sample data and the second sample data.
Optionally, the first feature identifier subunit 4042 is configured to, if the text matching model in the second service scenario identifies that a cross-correlation word having the same content as the first auto-correlation word exists in the cross-correlation words, perform feature identification on the identified cross-correlation word that is the same as the first auto-correlation word in the cross-correlation words to obtain a first identifier participle; the interactive feature information corresponding to the first identification participle is different from the first autocorrelation feature information of the first autocorrelation word corresponding to the first identification participle;
a second feature identifier subunit 4043, configured to perform feature identification on the identified cross-related words that are the same as the second auto-related word in the cross-related words to obtain second identified participles if the text matching model in the second service scene identifies that the cross-related words have the same content as the second auto-related word; the interactive feature information corresponding to the second identification participle is different from the second autocorrelation feature information of the second autocorrelation word corresponding to the second identification participle.
Specific implementation manners of the first attention output subunit 4031, the second attention output subunit 4032, the semantic feature fusion subunit 4033, and the fusion vector output subunit 4034 may refer to the description of the first matching model in the embodiment corresponding to fig. 3, which will not be described again here. Optionally, for specific implementation manners of the first feature obtaining subunit 4035, the second feature obtaining subunit 4036, the interactive feature obtaining subunit 4037, and the pooling vector output subunit 4038, reference may be made to the description of the second matching model in the embodiment corresponding to fig. 3, which will not be described again here. Optionally, for specific implementation manners of the first related word determining subunit 4039, the second related word determining subunit 4040, the cross-related word determining subunit 4041, the first feature identifier subunit 4042, and the second feature identifier subunit 4043, reference may be made to the description of the setting feature identifier in the embodiment corresponding to fig. 3, which will not be described again here.
And a target model determining unit 404, configured to determine the trained initial text matching model as a target text matching model when it is detected that the training classification result satisfies the classification convergence condition.
For specific implementation manners of the field word recognition unit 401, the word segmentation feature extraction unit 402, the model training unit 403, and the target model determination unit 404, reference may be made to the description of the target text matching model in the embodiment corresponding to fig. 3, and details will not be further described here.
Optionally, the initial sample data is text data in a sample labeling area, and the sample labeling area is an area in a text database having an association relationship with the initial sample data;
the associated text acquisition module 50 is configured to determine a domain to which the initial sample data belongs as a first domain in the sample labeling area, and acquire an associated text matched with a domain label of the first domain from the text database; the text database comprises a second domain except the first domain;
a domain dictionary constructing module 60 configured to, based on a keyword screening condition associated with the text database, screen and determine a domain keyword matched with the first domain from candidate words formed by the participles of the associated text, and construct a first domain dictionary corresponding to the first domain based on the domain keyword matched with the first domain;
wherein, the domain dictionary constructing module 60 includes: a word segmentation processing unit 601, a candidate word screening unit 602, an influence determination unit 603, and a domain dictionary construction unit 604;
a word segmentation processing unit 601, configured to perform word segmentation processing on the associated text to obtain a word segmentation set associated with words of the associated text, combine each word segmentation in the word segmentation set to obtain a candidate word associated with the associated text, and determine a cross-correlation degree between each word segmentation in the candidate words;
a candidate word screening unit 602, configured to obtain a cross-correlation threshold in the keyword screening conditions associated with the text database, screen candidate words with a cross-correlation degree greater than the cross-correlation threshold from the candidate words, and use the screened candidate words as character strings to be processed;
an influence determining unit 603, configured to determine an influence degree of the to-be-processed character string in the first domain, screen, from the to-be-processed character string, a to-be-processed character string whose influence degree meets the keyword screening condition, and use the screened to-be-processed character string as a domain keyword matched with the first domain; the influence degree is determined by the frequency of the character string to be processed appearing in the first field and the frequency of the character string to be processed appearing in the second field;
a domain dictionary constructing unit 604, configured to construct a first domain dictionary corresponding to the first domain based on the domain keyword matched with the first domain.
For specific implementation manners of the word segmentation processing unit 601, the candidate word screening unit 602, the influence determining unit 603, and the domain dictionary constructing unit 604, reference may be made to the description of constructing the first domain dictionary in the embodiment corresponding to fig. 3, and details will not be further described here.
And the keyword library determining module 70 is configured to obtain a second domain dictionary corresponding to the second domain, and determine a keyword database associated with the sample labeling area based on the first domain dictionary and the second domain dictionary.
Optionally, the text entry module 80 is configured to obtain third sample data entered by the target user through a target application corresponding to the second service scenario;
the text screening module 90 is configured to screen fourth sample data having a same domain label as the third sample data from a text library corresponding to the target application, and use the third sample data and the fourth sample data as a prediction sample pair; the fourth sample data is text data in a text database corresponding to the keyword database;
the matching degree prediction module 100 is configured to input the prediction sample pair into a target text matching model, and predict a matching degree between third sample data and fourth sample data in the prediction sample pair;
and a matching text returning module 110, configured to return the matching text corresponding to the fourth sample data to the user terminal corresponding to the target user based on the matching degree.
For specific implementation manners of the keyword recognition module 10, the relevancy determination module 20, and the training pair determination module 30 and the target model determination module 40, reference may be made to the description of steps S101 to S104 in the embodiment corresponding to fig. 3, and details will not be further described here. In addition, for a specific implementation manner of the associated text acquiring module 50, the domain dictionary constructing module 60, and the keyword library determining module 70, reference may be made to the description of the specific process for constructing the keyword database in the embodiment corresponding to fig. 3, which will not be described herein again. The specific implementation manners of the text entry module 80, the text screening module 90, the matching degree prediction module 100, and the matching text return module 110 may refer to the description of the target text matching model in the embodiment corresponding to fig. 9, and will not be further described here.
Further, please refer to fig. 11, which is a schematic structural diagram of a computer device according to an embodiment of the present application. As shown in fig. 11, the computer device 1000 can be applied to a computer device in the above-described text processing system. The computer device 1000 may include: the processor 1001, the network interface 1004, and the memory 1005, and the computer device 1000 may further include: a user interface 1003, and at least one communication bus 1002. Wherein a communication bus 1002 is used to enable connective communication between these components. The user interface 1003 may include a Display screen (Display) and a Keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a standard wireless interface. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1004 may be a high-speed RAM memory or a non-volatile memory (e.g., at least one disk memory). The memory 1005 may optionally be at least one memory device located remotely from the processor 1001. As shown in fig. 11, a memory 1005, which is a kind of computer storage medium, may include therein an operating system, a network communication module, a user interface module, and a device control application program.
The network interface 1004 in the computer device 1000 implements a network communication function, and the selectable user interface 1003 may further include a Display screen (Display) and a Keyboard (Keyboard). In the computer device 1000 shown in fig. 11, the network interface 1004 may provide a network communication function; the user interface 1003 is an interface for providing a user with input; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:
acquiring initial sample data, determining a first keyword in the initial sample data through a domain keyword in a keyword database, and acquiring candidate text data corresponding to a second keyword which has an incidence relation with the first keyword;
determining the association degree between the initial sample data and the candidate text data, screening the candidate text data of which the association degree meets the sample screening condition from the candidate text data, and taking the screened candidate text data as the enhanced text data corresponding to the initial sample data;
determining a training sample pair having an incidence relation with the keyword database according to the enhanced text data and the initial sample data; each sample data in the training sample pair carries a keyword identification corresponding to a domain keyword in the keyword database;
training an initial text matching model used for capturing the keyword identification based on the training sample pair, and determining the trained initial text matching model as a target text matching model; the target text matching model is subsequently used for predicting the matching degree of the obtained prediction sample pair.
It should be understood that the computer device 1000 described in this embodiment of the application may perform the description of the computer device in the embodiment corresponding to fig. 3 or fig. 8, and may also perform the description of the text data processing apparatus 1 in the embodiment corresponding to fig. 10, which is not described herein again. In addition, the beneficial effects of the same method are not described in detail.
Further, here, it is to be noted that: an embodiment of the present application further provides a computer storage medium, where the computer storage medium stores the aforementioned computer program executed by the text data processing apparatus 1, and the computer program includes program instructions, and when the processor executes the program instructions, the description of the text data processing method in the embodiment corresponding to fig. 3 or fig. 8 can be performed, so that details are not repeated here. In addition, the beneficial effects of the same method are not described in detail. For technical details not disclosed in the embodiments of the computer storage medium referred to in the present application, reference is made to the description of the embodiments of the method of the present application.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The above disclosure is only for the purpose of illustrating the preferred embodiments of the present application and is not to be construed as limiting the scope of the present application, so that the present application is not limited thereto, and all equivalent variations and modifications can be made to the present application.

Claims (15)

1. A text data processing method, comprising:
obtaining initial sample data, determining a first keyword in the initial sample data through a domain keyword in a keyword database, and obtaining candidate text data corresponding to a second keyword which has an incidence relation with the first keyword;
determining the association degree between the initial sample data and the candidate text data, screening the candidate text data of which the association degree meets a sample screening condition from the candidate text data, and taking the screened candidate text data as the enhanced text data corresponding to the initial sample data;
determining a training sample pair having an incidence relation with the keyword database according to the enhanced text data and the initial sample data; each sample data in the training sample pair carries a keyword identification corresponding to a domain keyword in the keyword database;
training an initial text matching model used for capturing the keyword identification based on the training sample pair, and determining the trained initial text matching model as a target text matching model; and the target text matching model is subsequently used for predicting the matching degree of the obtained prediction sample pair.
2. The method of claim 1, wherein the initial sample data is text data in a sample labeling area, and the sample labeling area is an area in a text database having an association relationship with the initial sample data;
the method further comprises the following steps:
determining the domain to which the initial sample data belongs as a first domain in the sample labeling area, and acquiring an associated text matched with a domain label of the first domain from the text database; the text database comprises a second domain except the first domain;
screening and determining a domain keyword matched with the first domain from candidate words formed by word segmentation of the associated text based on a keyword screening condition associated with the text database, and constructing a first domain dictionary corresponding to the first domain based on the domain keyword matched with the first domain;
and acquiring a second domain dictionary corresponding to the second domain, and determining a keyword database associated with the sample labeling area based on the first domain dictionary and the second domain dictionary.
3. The method according to claim 2, wherein the screening and determining a domain keyword matching the first domain from the candidate words formed by the word segmentation of the associated text based on the keyword screening condition associated with the text database, and the constructing a first domain dictionary corresponding to the first domain based on the domain keyword matching the first domain comprises:
performing word segmentation processing on the associated text to obtain a word segmentation set associated with the word segmentation of the associated text, combining each word segmentation in the word segmentation set to obtain a candidate word associated with the associated text, and determining the cross correlation degree between each word segmentation in the candidate words;
acquiring a cross-correlation threshold value in a keyword screening condition associated with the text database, screening candidate words with cross-correlation degrees larger than the cross-correlation threshold value from the candidate words, and taking the screened candidate words as character strings to be processed;
determining the influence of the character string to be processed in the first field, screening the character string to be processed with the influence reaching the keyword screening condition from the character string to be processed, and taking the screened character string to be processed as a field keyword matched with the first field; the influence degree is determined by the frequency of the character string to be processed appearing in the first domain and the frequency of the character string to be processed appearing in the second domain;
and constructing a first domain dictionary corresponding to the first domain based on the domain keywords matched with the first domain.
4. The method according to claim 2, wherein the obtaining initial sample data, determining a first keyword in the initial sample data by a domain keyword in a keyword database, and obtaining candidate text data corresponding to a second keyword having an association relationship with the first keyword comprises:
acquiring initial sample data from the sample labeling area, acquiring the first domain dictionary from a keyword database, and identifying domain keywords in the initial sample data based on the first domain dictionary;
taking the domain keyword identified in the initial sample data as a first keyword;
acquiring a target associated text containing the first keyword from the associated texts contained in the keyword database, and taking a domain keyword in the target associated text as a second keyword;
and taking the target associated text containing the second keyword as candidate text data corresponding to the second keyword having an association relation with the first keyword.
5. The method according to claim 4, wherein the determining a degree of association between the initial sample data and the candidate text data, screening candidate text data of which the degree of association satisfies a sample screening condition from the candidate text data, and using the screened candidate text data as the enhanced text data corresponding to the initial sample data comprises:
determining the association degree between the initial sample data and the candidate text data according to the coverage ratio between the first keyword in the initial sample data and the second keyword in the candidate text data;
sorting the association degrees in the candidate text data to obtain text data to be processed corresponding to the candidate text data;
screening the text data to be processed with the association degree larger than a first association threshold value and smaller than a second association threshold value from the sorted text data to be processed;
taking the screened text data to be processed as enhanced text data corresponding to the initial sample data; the first correlation threshold is less than the second correlation threshold, and the first correlation threshold and the second key threshold are both thresholds in a sample screening condition.
6. The method of claim 1, wherein the training sample pair comprises first sample data and second sample data; the first sample data comprises initial sample data carrying keyword identification; the second sample data comprises enhanced text data carrying keyword identification;
the training of the initial text matching model for capturing the keyword identification based on the training sample pair and the determination of the trained initial text matching model as a target text matching model comprise:
using the initial text matching model to take the domain key words corresponding to the key word identifications in the first sample data as first domain key words and take the domain key words corresponding to the key word identifications in the second sample data as second domain key words;
acquiring first word segmentation characteristic information of a first word segmentation in the first sample data and second word segmentation characteristic information of a second word segmentation in the second sample data;
training the training sample pair based on the first word segmentation feature information, the second word segmentation feature information, the first field keywords, the second field keywords and the initial text matching model to obtain a training classification result;
and when the training classification result is detected to meet the classification convergence condition, determining the trained initial text matching model as a target text matching model.
7. The method of claim 6, wherein the initial text matching model comprises a text matching model in a first business scenario; the text matching model under the first service scene comprises a keyword attention layer, a fusion layer and a classification layer;
training the training sample pair based on the first segmentation feature information, the second segmentation feature information, the first field keywords, the second field keywords and the initial text matching model to obtain a training classification result, including:
inputting first word segmentation characteristic information of the first word segmentation and second word segmentation characteristic information of the second domain keyword into the keyword attention layer, and outputting first attention characteristic information corresponding to the keyword attention layer; the first attention characteristic information is used for representing the correlation between the second domain keyword and the first segmentation;
inputting second word segmentation characteristic information of the second word segmentation and first word segmentation characteristic information of the first domain keyword into the keyword attention layer, and outputting second attention characteristic information corresponding to the keyword attention layer; the second attention characteristic information is used for representing the correlation between the first domain keyword and the second participle;
acquiring first semantic feature information of the first sample data and second semantic feature information of the second sample data, and performing semantic fusion on the first semantic feature information and the second semantic feature information to obtain fused semantic feature information;
inputting the first attention feature information, the second attention feature information and the fusion feature information into the fusion layer, outputting fusion feature vectors corresponding to the training sample pairs, and outputting training classification results of the training sample pairs through the classification layer.
8. The method of claim 6, wherein the initial text matching model comprises a text matching model in a second business scenario; the text matching model under the second service scene comprises a feature combination layer, an average pooling layer, a full connection layer and a classification layer; the first participles comprise M first sub-participles except the first domain keyword; the second participles comprise N second sub-participles except the second field keywords; m and N are positive integers;
training the training sample pair based on the first segmentation feature information, the second segmentation feature information, the first field keywords, the second field keywords and the initial text matching model to obtain a training classification result, including:
determining first sub-position information of the M first sub-participles and second sub-position information of the first domain keyword in the first sample data based on the first participle feature information, and acquiring first autocorrelation feature information of a first autocorrelation word formed by the M first sub-participles and the first domain keyword in the first sample data according to the M first sub-position information and the second sub-position information;
determining third sub-position information of the N second sub-participles and fourth sub-position information of the second domain keyword based on the second participle feature information in the second sample data, and acquiring second autocorrelation feature information of a second autocorrelation word formed by the N second sub-participles and the second domain keyword in the second sample data according to the N third sub-position information and the fourth sub-position information;
acquiring interactive characteristic information corresponding to cross-correlation words between the first sample data and second sample data;
and taking the first autocorrelation characteristic information, the second autocorrelation characteristic information and the interaction characteristic information as input characteristics of the average pooling layer, outputting pooling vectors corresponding to the average pooling layer, and training the training sample pairs according to the pooling vectors, the full-link layer and the classification layer to obtain training classification results.
9. The method of claim 8, further comprising:
performing word segmentation combination on the M first sub-participles and the first domain keyword in the first sample data according to the M first sub-position information and the second sub-position information, and taking a combined word obtained after the word segmentation combination as a first autocorrelation word of the first sample data;
performing word segmentation combination on the N second sub-participles and the second domain keyword in the second sample data according to the N third sub-position information and the fourth sub-position information, and taking a combined word obtained after the word segmentation combination as a second autocorrelation word of the first sample data;
performing word segmentation combination on the M first sub-word segmentations, the first field keywords, the N second sub-word segmentations and the second field keywords, and taking a combined word obtained after word segmentation combination as a cross-correlation word between the first sample data and the second sample data.
10. The method of claim 9, further comprising:
if the text matching model in the second service scene identifies that the cross-correlation words with the same content as the first self-correlation words exist in the cross-correlation words, performing feature identification on the identified cross-correlation words with the same content as the first self-correlation words in the cross-correlation words to obtain first identification participles; the interactive feature information corresponding to the first identification participle is different from the first autocorrelation feature information of the first autocorrelation word corresponding to the first identification participle;
if the text matching model in the second service scene identifies that the cross-correlation words with the same content as the second self-correlation words exist in the cross-correlation words, performing feature identification on the identified cross-correlation words with the same content as the second self-correlation words in the cross-correlation words to obtain second identification participles; and the interactive characteristic information corresponding to the second identification participle is different from the second autocorrelation characteristic information of the second autocorrelation word corresponding to the second identification participle.
11. The method according to claim 9, wherein the taking the first autocorrelation feature information, the second autocorrelation feature information, and the interaction feature information as input features of the average pooling layer, outputting a pooling vector corresponding to the average pooling layer, and training the training sample pair according to the pooling vector, the full-link layer, and the classification layer to obtain a training classification result includes:
selecting a combination word containing the first domain keyword and the second domain keyword from a first autocorrelation word corresponding to the first autocorrelation characteristic information, a second autocorrelation word corresponding to the second autocorrelation characteristic information and a cross-correlation word corresponding to the interaction characteristic information, taking the selected combination word as a first classification combination word, and acquiring first combination characteristic information corresponding to the first classification combination word;
taking a combined word except the first domain keyword and the second domain keyword in the first auto-related word, the second auto-related word and the cross-related word as a second classified combined word, and acquiring second combined characteristic information corresponding to the second classified combined word;
acquiring a first feature vector corresponding to the first combined feature information and a second feature vector corresponding to the second combined feature information;
adjusting vector values in the second feature vectors, taking the vector values of the first feature vectors and the adjusted vector values of the second feature vectors as first model parameters of a text matching model in the second service scene, inputting the first feature vectors and the adjusted second feature vectors into an average pooling pool corresponding to the first model parameters, outputting first pooling vectors corresponding to the average pooling layer, and training the training sample pairs according to the first pooling vectors, the full-connection layer and the classification layer to obtain training classification results corresponding to the first model parameters;
if the training classification result corresponding to the first model parameter indicates that the first model parameter does not meet the convergence condition, adjusting a vector value in the first feature vector, taking the adjusted vector value of the first feature vector and the adjusted vector value of the second feature vector as second model parameters of a text matching model in the second service scene, inputting the adjusted first feature vector and the adjusted second feature vector into an average pooling pool corresponding to the second model parameter, outputting a second pooling vector corresponding to the average pooling layer, and training the training sample pair according to the second pooling vector, the full-connection layer and the classification layer to obtain a training classification result corresponding to the second model parameter.
12. The method of claim 11, further comprising:
acquiring third sample data input by a target user through a target application corresponding to the second service scene;
screening fourth sample data with the same field label as third sample data from a text library corresponding to the target application, and taking the third sample data and the fourth sample data as the prediction sample pair; the fourth sample data is text data in a text database corresponding to the keyword database;
inputting the prediction sample pair into the target text matching model, and predicting to obtain the matching degree of the third sample data and the fourth sample data in the prediction sample pair;
and returning the answer text corresponding to the fourth sample data to the user terminal corresponding to the target user based on the matching degree.
13. A text data processing apparatus, applied to a text processing system, comprising:
the system comprises a keyword identification module, a domain searching module and a keyword searching module, wherein the keyword identification module is used for acquiring initial sample data, determining a first keyword in the initial sample data through a domain keyword in a keyword database, and acquiring candidate text data corresponding to a second keyword which has an incidence relation with the first keyword;
the association degree determining module is used for determining the association degree between the initial sample data and the candidate text data, screening the candidate text data of which the association degree meets a sample screening condition from the candidate text data, and taking the screened candidate text data as the enhanced text data corresponding to the initial sample data;
a training pair determining module, configured to determine, according to the enhanced text data and the initial sample data, a training sample pair having an association relationship with the keyword database; each sample data in the training sample pair carries a keyword identification corresponding to a domain keyword in the keyword database;
the target model determining module is used for training an initial text matching model used for capturing the keyword identification based on the training sample pair, and determining the trained initial text matching model as a target text matching model; and the target text matching model is subsequently used for predicting the matching degree of the obtained prediction sample pair.
14. A computer device, comprising: a processor, a memory, a network interface;
the processor is connected to a memory for providing data communication functions, a network interface for storing a computer program, and a processor for calling the computer program to perform the method according to any one of claims 1 to 12.
15. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, perform the method according to any one of claims 1-12.
CN202010239303.4A 2020-03-30 2020-03-30 Text data processing method, device, equipment and storage medium Active CN111444326B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010239303.4A CN111444326B (en) 2020-03-30 2020-03-30 Text data processing method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010239303.4A CN111444326B (en) 2020-03-30 2020-03-30 Text data processing method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111444326A true CN111444326A (en) 2020-07-24
CN111444326B CN111444326B (en) 2023-10-20

Family

ID=71649232

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010239303.4A Active CN111444326B (en) 2020-03-30 2020-03-30 Text data processing method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111444326B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112115267A (en) * 2020-09-28 2020-12-22 平安科技(深圳)有限公司 Training method, device and equipment of text classification model and storage medium
CN112149400A (en) * 2020-09-23 2020-12-29 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN112784911A (en) * 2021-01-29 2021-05-11 北京百度网讯科技有限公司 Training sample generation method and device, electronic equipment and storage medium
CN113011126A (en) * 2021-03-11 2021-06-22 腾讯科技(深圳)有限公司 Text processing method and device, electronic equipment and computer readable storage medium
CN113111966A (en) * 2021-04-29 2021-07-13 北京九章云极科技有限公司 Image processing method and image processing system
CN113536788A (en) * 2021-07-28 2021-10-22 平安科技(深圳)有限公司 Information processing method, device, storage medium and equipment
CN113553431A (en) * 2021-07-27 2021-10-26 深圳平安综合金融服务有限公司 User label extraction method, device, equipment and medium
CN113610503A (en) * 2021-08-11 2021-11-05 中国平安人寿保险股份有限公司 Resume information processing method, device, equipment and medium
WO2022165634A1 (en) * 2021-02-02 2022-08-11 Huawei Technologies Co., Ltd. Apparatus and method for type matching of a text sample
CN115859975A (en) * 2023-02-07 2023-03-28 支付宝(杭州)信息技术有限公司 Data processing method, device and equipment
CN117235237A (en) * 2023-11-10 2023-12-15 腾讯科技(深圳)有限公司 Text generation method and related device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102169495A (en) * 2011-04-11 2011-08-31 趣拿开曼群岛有限公司 Industry dictionary generating method and device
CN104375989A (en) * 2014-12-01 2015-02-25 国家电网公司 Natural language text keyword association network construction system
CN108536708A (en) * 2017-03-03 2018-09-14 腾讯科技(深圳)有限公司 A kind of automatic question answering processing method and automatically request-answering system
CN108804512A (en) * 2018-04-20 2018-11-13 平安科技(深圳)有限公司 Generating means, method and the computer readable storage medium of textual classification model
CN109614478A (en) * 2018-12-18 2019-04-12 北京中科闻歌科技股份有限公司 Construction method, key word matching method and the device of term vector model
CN109829430A (en) * 2019-01-31 2019-05-31 中科人工智能创新技术研究院(青岛)有限公司 Cross-module state pedestrian based on isomery stratification attention mechanism recognition methods and system again
CN110222707A (en) * 2019-04-28 2019-09-10 平安科技(深圳)有限公司 A kind of text data Enhancement Method and device, electronic equipment

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102169495A (en) * 2011-04-11 2011-08-31 趣拿开曼群岛有限公司 Industry dictionary generating method and device
CN104375989A (en) * 2014-12-01 2015-02-25 国家电网公司 Natural language text keyword association network construction system
CN108536708A (en) * 2017-03-03 2018-09-14 腾讯科技(深圳)有限公司 A kind of automatic question answering processing method and automatically request-answering system
CN108804512A (en) * 2018-04-20 2018-11-13 平安科技(深圳)有限公司 Generating means, method and the computer readable storage medium of textual classification model
CN109614478A (en) * 2018-12-18 2019-04-12 北京中科闻歌科技股份有限公司 Construction method, key word matching method and the device of term vector model
CN109829430A (en) * 2019-01-31 2019-05-31 中科人工智能创新技术研究院(青岛)有限公司 Cross-module state pedestrian based on isomery stratification attention mechanism recognition methods and system again
CN110222707A (en) * 2019-04-28 2019-09-10 平安科技(深圳)有限公司 A kind of text data Enhancement Method and device, electronic equipment

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112149400A (en) * 2020-09-23 2020-12-29 腾讯科技(深圳)有限公司 Data processing method, device, equipment and storage medium
CN112115267A (en) * 2020-09-28 2020-12-22 平安科技(深圳)有限公司 Training method, device and equipment of text classification model and storage medium
CN112115267B (en) * 2020-09-28 2023-07-07 平安科技(深圳)有限公司 Training method, device, equipment and storage medium of text classification model
CN112784911A (en) * 2021-01-29 2021-05-11 北京百度网讯科技有限公司 Training sample generation method and device, electronic equipment and storage medium
CN112784911B (en) * 2021-01-29 2024-01-19 北京百度网讯科技有限公司 Training sample generation method and device, electronic equipment and storage medium
WO2022165634A1 (en) * 2021-02-02 2022-08-11 Huawei Technologies Co., Ltd. Apparatus and method for type matching of a text sample
CN113011126A (en) * 2021-03-11 2021-06-22 腾讯科技(深圳)有限公司 Text processing method and device, electronic equipment and computer readable storage medium
CN113011126B (en) * 2021-03-11 2023-06-30 腾讯科技(深圳)有限公司 Text processing method, text processing device, electronic equipment and computer readable storage medium
CN113111966A (en) * 2021-04-29 2021-07-13 北京九章云极科技有限公司 Image processing method and image processing system
CN113553431A (en) * 2021-07-27 2021-10-26 深圳平安综合金融服务有限公司 User label extraction method, device, equipment and medium
CN113536788B (en) * 2021-07-28 2023-12-05 平安科技(上海)有限公司 Information processing method, device, storage medium and equipment
CN113536788A (en) * 2021-07-28 2021-10-22 平安科技(深圳)有限公司 Information processing method, device, storage medium and equipment
CN113610503A (en) * 2021-08-11 2021-11-05 中国平安人寿保险股份有限公司 Resume information processing method, device, equipment and medium
CN115859975A (en) * 2023-02-07 2023-03-28 支付宝(杭州)信息技术有限公司 Data processing method, device and equipment
CN117235237A (en) * 2023-11-10 2023-12-15 腾讯科技(深圳)有限公司 Text generation method and related device
CN117235237B (en) * 2023-11-10 2024-03-12 腾讯科技(深圳)有限公司 Text generation method and related device

Also Published As

Publication number Publication date
CN111444326B (en) 2023-10-20

Similar Documents

Publication Publication Date Title
CN111444326B (en) Text data processing method, device, equipment and storage medium
CN113255755B (en) Multi-modal emotion classification method based on heterogeneous fusion network
CN110069709B (en) Intention recognition method, device, computer readable medium and electronic equipment
CN108959270A (en) A kind of entity link method based on deep learning
CN111190997A (en) Question-answering system implementation method using neural network and machine learning sequencing algorithm
CN112329824A (en) Multi-model fusion training method, text classification method and device
CN111666400B (en) Message acquisition method, device, computer equipment and storage medium
CN113469298B (en) Model training method and resource recommendation method
CN111625715B (en) Information extraction method and device, electronic equipment and storage medium
CN113051380B (en) Information generation method, device, electronic equipment and storage medium
CN112015928A (en) Information extraction method and device of multimedia resource, electronic equipment and storage medium
CN111460783B (en) Data processing method and device, computer equipment and storage medium
CN114330966A (en) Risk prediction method, device, equipment and readable storage medium
CN115131698A (en) Video attribute determination method, device, equipment and storage medium
CN114691864A (en) Text classification model training method and device and text classification method and device
CN114339450A (en) Video comment generation method, system, device and storage medium
CN115455171A (en) Method, device, equipment and medium for mutual retrieval and model training of text videos
CN113486143A (en) User portrait generation method based on multi-level text representation and model fusion
CN116977701A (en) Video classification model training method, video classification method and device
CN112464106B (en) Object recommendation method and device
Xu et al. Estimating similarity of rich internet pages using visual information
CN114595370A (en) Model training and sorting method and device, electronic equipment and storage medium
CN113569091A (en) Video data processing method and device
CN111782762A (en) Method and device for determining similar questions in question answering application and electronic equipment
CN113836345A (en) Information processing apparatus, information processing method, and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40025832

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant