CN112052678A

CN112052678A - Model training and corpus processing method and device and computer equipment

Info

Publication number: CN112052678A
Application number: CN202011019127.XA
Authority: CN
Inventors: 张文瑜
Original assignee: Volkswagen Mobvoi Beijing Information Technology Co Ltd
Current assignee: Volkswagen Mobvoi Beijing Information Technology Co Ltd
Priority date: 2020-09-24
Filing date: 2020-09-24
Publication date: 2020-12-08

Abstract

The embodiment of the invention discloses a model training and corpus processing method, a model training and corpus processing device and computer equipment. The corpus processing method comprises the following steps: obtaining a corpus to be processed, and inputting the corpus to be processed into a dependency relationship analysis model; carrying out dependency relationship analysis on the linguistic data to be processed through a dependency relationship analysis model; determining abnormal corpora from the corpora to be processed according to the analysis result of the dependency relationship analysis model; the dependency relationship analysis model is obtained through model training. The technical scheme of the embodiment of the invention can improve the processing efficiency and accuracy of the corpus processing.

Description

Model training and corpus processing method and device and computer equipment

Technical Field

The embodiment of the invention relates to the technical field of text processing, in particular to a method and a device for model training and corpus processing and computer equipment.

Background

The corpus processing is a technical key point which is widely applied in the technical field of text processing, and the duration of the text processing is directly influenced by the corpus processing efficiency.

It can be understood that many abnormal corpora, i.e. corpora having no practical meaning, are inevitably included in many collected corpora to be processed, and these abnormal corpora may also be referred to as dirty predictions. The syntax relation of the abnormal corpus is disordered, the dependency relation among all words is wrong, and when the collected to-be-processed corpus is processed, an abnormal forecast with disordered syntax relation is often required to be screened out and deleted, so that all the corpuses in the to-be-processed corpus are ensured to be normal corpuses.

Most of the existing processing methods for screening abnormal corpora adopt a manual processing mode, a large amount of labor cost and time cost are required to be invested, and the corpus processing efficiency is low due to sentence-by-sentence manual screening. Although the prior art already has a model for judging the dependency relationship of the corpora, the model can simultaneously perform dependency relationship analysis on the normal corpora and the abnormal corpora, the correctness of the dependency relationship analysis result cannot be determined, and the abnormal corpora cannot be screened out.

Disclosure of Invention

The embodiment of the invention provides a method, a device and a computer device for model training and corpus processing, which aim to improve the processing efficiency and accuracy of corpus processing.

In a first aspect, an embodiment of the present invention provides a model training method, including:

obtaining corpus sample data with dependency relationship among word segmentation samples;

inputting the corpus sample data into a preset machine learning model for dependency relationship analysis training to obtain a dependency relationship analysis model;

and the dependency relationship analysis model is used for carrying out exception screening processing on the exception corpora.

In a second aspect, an embodiment of the present invention further provides a corpus processing method, including:

obtaining a corpus to be processed, and inputting the corpus to be processed into a dependency relationship analysis model;

performing dependency relationship analysis on the linguistic data to be processed through the dependency relationship analysis model;

determining abnormal corpora from the corpora to be processed according to the analysis result of the dependency relationship analysis model;

wherein, the dependency relationship analysis model is a model obtained by training the model training method according to the first aspect.

In a third aspect, an embodiment of the present invention further provides a model training apparatus, including:

the corpus sample data acquisition module is used for acquiring corpus sample data with dependency relationship among word segmentation samples;

the dependency relationship analysis training module is used for inputting the corpus sample data into a preset machine learning model for dependency relationship analysis training to obtain a dependency relationship analysis model;

In a fourth aspect, an embodiment of the present invention further provides a corpus processing apparatus, including:

the system comprises a to-be-processed corpus obtaining module, a dependency relationship analysis module and a processing module, wherein the to-be-processed corpus obtaining module is used for obtaining a to-be-processed corpus and inputting the to-be-processed corpus into the dependency relationship analysis module;

the dependency relationship analysis module is used for carrying out dependency relationship analysis on the linguistic data to be processed through the dependency relationship analysis model;

the abnormal corpus determining module is used for determining abnormal corpuses from the corpuses to be processed according to the analysis result of the dependency relationship analysis model;

In a fifth aspect, an embodiment of the present invention further provides a computer device, where the computer device includes:

one or more processors;

storage means for storing one or more programs;

when the one or more programs are executed by the one or more processors, the one or more processors implement the model training method or the corpus processing method provided by any embodiment of the present invention.

In a sixth aspect, an embodiment of the present invention further provides a computer storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the model training method or the corpus processing method provided in any embodiment of the present invention.

The embodiment of the invention obtains the dependency relationship analysis model by carrying out dependency relationship analysis training on the corpus sample data, and utilizes the dependency relationship analysis model obtained by training to carry out dependency relationship analysis on the corpus to be processed so as to determine the abnormal corpus from the corpus to be processed according to the analysis result, thereby solving the problems of high labor and time cost in the conventional method for screening the abnormal corpus from the corpus to be processed, realizing automatic identification and abnormal screening processing on the abnormal corpus, and further improving the processing efficiency and accuracy of corpus processing.

Drawings

FIG. 1 is a flowchart of a model training method according to an embodiment of the present invention;

FIG. 2 is a diagram illustrating the labeling result of a dependency relationship according to an embodiment of the present invention;

FIG. 3 is a flowchart illustrating a corpus processing method according to a second embodiment of the present invention;

FIG. 4 is a schematic diagram illustrating an effect of corpus processing through a dependency analysis model according to a second embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating an effect of corpus processing through a dependency analysis model according to a second embodiment of the present invention;

FIG. 6 is a schematic diagram illustrating an effect of corpus processing through a dependency analysis model according to a second embodiment of the present invention;

FIG. 7 is a schematic diagram illustrating an effect of corpus processing through a dependency analysis model according to a second embodiment of the present invention;

fig. 8 is a flowchart of a corpus processing method based on dependency relationship analysis according to a second embodiment of the present invention;

FIG. 9 is a schematic diagram of a model training apparatus according to a third embodiment of the present invention;

FIG. 10 is a diagram illustrating a corpus processing apparatus according to a fourth embodiment of the present invention;

fig. 11 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting of the invention.

It should be further noted that, for the convenience of description, only some but not all of the relevant aspects of the present invention are shown in the drawings. Before discussing exemplary embodiments in more detail, it should be noted that some exemplary embodiments are described as processes or methods depicted as flowcharts. Although a flowchart may describe the operations (or steps) as a sequential process, many of the operations can be performed in parallel, concurrently or simultaneously. In addition, the order of the operations may be re-arranged. The process may be terminated when its operations are completed, but may have additional steps not included in the figure. The processes may correspond to methods, functions, procedures, subroutines, and the like.

Example one

Fig. 1 is a flowchart of a model training method according to an embodiment of the present invention, where the embodiment is applicable to a case where dependency analysis training is performed by using corpus sample data to obtain a dependency analysis model, and the method may be executed by a model training apparatus, which may be implemented by software and/or hardware, and may be generally integrated in a computer device, as shown in fig. 1, where the method includes the following operations:

s110, obtaining corpus sample data with dependency relationship among word segmentation samples.

The corpus sample data may be standard corpus data, and is used as sample data for model training. The word segmentation sample may be word segmentation data obtained by performing word segmentation processing on each corpus sample in the corpus sample data. Normal and reasonable dependency relationship exists among all word segmentation samples in the corpus sample data. Dependencies may be used to represent the syntactic structure of a participle sample in corpus sample data. For example, the dependency relationship may include structural relationships such as a predicate, an actor-guest, and an inter-guest, and the embodiment of the present invention does not limit the specific relationship type of the dependency relationship.

In the embodiment of the invention, the corpus sample data with dependency relationship among the participle samples can be obtained for training the dependency relationship analysis model. And the word segmentation samples included in each corpus sample in the corpus sample data have correct dependency relationship. Optionally, two participle samples having a dependency relationship may form a dependency pair, where one participle sample is a dominant word and the other participle sample is a modifier word, which may also be referred to as a dependent word.

In an optional embodiment of the present invention, obtaining corpus sample data with dependency relationship between word segmentation samples may include: acquiring standard corpus sample data; performing word segmentation processing on standard corpus sample data to obtain word segmentation sample data; performing part-of-speech tagging processing on the word segmentation sample data to obtain part-of-speech tagging result sample data; and determining the dependency relationship among the word segmentation samples in the word segmentation sample data according to the part of speech tagging result sample data to obtain the corpus sample data.

The standard corpus sample data may be standard corpus data used for training a model in an existing corpus, for example, sample data composed of corpora such as a news broadcast corpus and/or a textbook corpus, and the content of the standard corpus sample data is not specifically limited in the embodiment of the present invention. The standard corpus sample data may include a plurality of standard corpus samples. The word segmentation sample data may be word segmentation data obtained by performing word segmentation processing on standard corpus sample data, and may include a plurality of word segmentation samples. The part-of-speech tagging result sample data may be part-of-speech tagging data obtained by performing part-of-speech tagging processing on the part-of-speech sample data. Correspondingly, the part-of-speech tagging result sample data may include a plurality of part-of-speech tagging result samples.

Specifically, when obtaining the corpus sample data, the standard corpus sample data may be obtained first, and the word segmentation sample data is obtained by performing word segmentation processing on the standard corpus sample data. And after word segmentation sample data is obtained, performing part-of-speech tagging processing on the word segmentation sample data to obtain corresponding part-of-speech tagging result sample data. Furthermore, the dependency relationship among the word segmentation samples is determined according to the part of speech tagging result sample data, and finally the corpus sample data can be obtained.

At present, two sets of mainstream specifications of part-of-speech tagging processing are the Bingzhou tree library version of Bingzhou university and the participle part-of-speech tagging specification of Beijing university respectively. In a specific example, word segmentation processing is performed on the obtained news broadcast corpus as standard corpus sample data to obtain word segmentation sample data, word tagging processing is performed on the word segmentation sample data according to word segmentation part-of-speech tagging specifications of the Beijing university to obtain part-of-speech tagging result sample data, and finally, the dependency relationship among the word segmentation sample data is determined according to the part-of-speech tagging result sample data to obtain the corpus sample data. In order to improve the model training efficiency, the word segmentation processing and the part-of-speech tagging processing of the standard corpus sample can be automatically completed through computer software.

Table 1 is a word segmentation and annotation specification table provided in the first embodiment of the present invention. As shown in table 1, the part-of-speech tagging specification of the university of beijing includes 26 part-of-speech tags. The part-of-speech sample data corresponding to the standard corpus sample data can be subjected to part-of-speech tagging according to the 26 part-of-speech tags, so that part-of-speech tagging result sample data is obtained. For example, a standard corpus sample in the standard corpus sample data is "there is a huge potential for economic development in the new area of stamina" and the part-of-speech tagging processing is performed on the participle sample of the standard corpus sample, and the obtained part-of-speech tagging result sample may be "a new area of stamina/ns/n economy/n development/v has/v is huge/a/u potential/n.

TABLE 1

In an optional embodiment of the present invention, determining a dependency relationship between the word segmentation samples in the word segmentation sample data according to the part of speech tagging result sample data may include: determining the front-back sequence relation and/or the modification relation of each word segmentation sample according to the part-of-speech tagging result sample data; and determining the dependency relationship among the word segmentation samples according to the front-back sequence relationship and/or the modification relationship of the word segmentation samples.

The front-back order relationship may be a front-back position relationship between word segmentation samples in the same corpus. The modifying relation may be a kind of modifier in grammar between the participle samples. For example, the adjective modifies the noun, the adverb modifies the adjective, and the adverb modifies the verb, and the like, and the embodiment of the present invention does not limit the specific type of the modification.

Specifically, after the part-of-speech tagging processing is performed on the word sample data to obtain part-of-speech tagging result sample data, the context and/or modification relationship between the word samples can be determined according to the part-of-speech tagging result sample data, so as to determine the dependency relationship between the word samples according to the context and/or modification relationship between the word samples.

Table 2 is a dependency tag table provided in an implementation of the present invention, and in a specific example, the dependency tag table shown in table 2 and including 30 dependency tags may be used to determine the dependency between the participle samples in the participle sample data for the part-of-speech tagging result sample data, and the specific dependency determination rule may use the existing bingo state treebridge dependency tagging system.

TABLE 2

Fig. 2 is a schematic diagram of a labeling result of a dependency relationship according to an embodiment of the present invention. In a specific example, as shown in fig. 2, assuming that a standard corpus sample is "there is a great potential for economic development in the new area of androsam", performing word segmentation processing on the standard corpus sample to obtain word segmentation samples, and then labeling the dependency relationship among the word segmentation samples to obtain the dependency relationship labeling result of fig. 2. That is, there are ATT dependencies, development and presence of SBV dependencies, VOB dependencies and potential and huge sum of RAD dependencies between the male and new zones. The dependency relationship may be represented by a directed arc, the directed arc used to represent the dependency relationship may also be referred to as a dependent arc, and the direction of the dependent arc may be pointed to the dominant word by the dependent word. For example, the dependent word in one dependency pair, male "an" and "new" is the new zone, and the dominant word is male "an". The direction of the dependent arc points from the new zone to the peace.

In an optional embodiment of the present invention, after determining the dependency relationship between the word segmentation samples according to the context sequence relationship and/or the modification relationship of the word segmentation samples, the method may further include: and establishing a dependency relationship database according to the dependency relationship among the word segmentation samples.

The dependency relationship database can be used for storing each participle sample and a data set of dependency relationships among the participle samples.

Correspondingly, after determining the dependency relationship among the word segmentation samples according to the front-back order relationship and/or the modification relationship of the word segmentation samples, the dependency relationship among the word segmentation samples can be stored so as to establish a dependency relationship database. The dependency relationship database can be used as a matching standard for determining the abnormal corpus for the corpus of the dependency relationship analysis model.

S120, inputting the corpus sample data into a preset machine learning model for dependency relationship analysis training to obtain a dependency relationship analysis model; and the dependency relationship analysis model is used for carrying out exception screening processing on the exception corpora.

The preset machine learning model may be any type of machine learning model as long as the dependency relationship analysis training can be performed, and the specific type of the preset machine learning model is not limited in the embodiment of the present invention. It should be noted that the preset machine learning model may be an original model to be trained, or may also be an existing mature model, or may also be a model obtained by converting a phrase structure tree library model or other tree library models. The dependency relationship analysis model may be a mature model obtained after a preset machine learning model performs dependency relationship analysis training, the training method may be, for example, a retention cross validation method, a K-fold cross validation method, or a retention cross validation method 1, and the like. Exception screening may be one type of process used for corpus classification. For example, the exception screening process may screen out exception corpuses that have no actual meaning.

Specifically, after the corpus sample data is obtained, the corpus sample data can be input into a preset machine learning model for dependency analysis training, and the dependency analysis model is obtained after the dependency analysis training is completed. The dependency analysis model can be used to screen the abnormal corpus according to the analyzed dependency.

For example, the corpus sample data for dependency training may refer to the following magnitudes: the corpus sample data may include 8301 newscast corpuses, in which the average length of sentences in the newscast corpuses is 31 words, and 10754 textbook corpuses, in which the average length of sentences in the textbook corpuses is 14 words.

In an optional embodiment of the present invention, before inputting the corpus sample data into the preset machine learning model for performing the dependency analysis training, the method may further include: inputting the word segmentation sample data into a preset machine learning model for word segmentation training; and inputting the sample data of the part-of-speech tagging result into a preset machine learning model for part-of-speech tagging training.

The word segmentation training can be to train a preset machine learning model according to a word segmentation sample, and the preset machine learning model has the word segmentation processing capability on the corpus after training. The part-of-speech tagging training can be to train a preset machine learning model according to part-of-speech tagging result sample data, and the preset machine learning model has the processing capability of performing part-of-speech tagging on participles in the corpus after training.

In the embodiment of the invention, before performing dependency analysis training, the word segmentation processing capability and part-of-speech tagging processing capability of the training model are required first. Specifically, the word segmentation training can be performed by inputting the word segmentation sample data into the preset machine learning model, and after the preset machine learning model grasps mature word segmentation capability, the part-of-speech tagging result sample data is input into the preset machine learning model to perform part-of-speech tagging training. At the moment, the preset machine learning model has word segmentation processing capacity and part-of-speech tagging processing capacity at the same time. On the basis, dependency relationship training is carried out on a preset machine learning model, and the finally obtained dependency relationship analysis model simultaneously has word segmentation processing capacity, part of speech tagging processing capacity and dependency relationship analysis capacity.

At present, although the tool for analyzing the dependency relationship of the corpus can make the dependency relationship judgment (that is, there is a possibility of making a mistake in the dependency relationship judgment) on the abnormal corpus, the accuracy rate does not reach 90% and there is a mistake space of about 10%. Meanwhile, the existing dependency syntax analyzer processes the dependency relationship of the abnormal corpus as well as judges the dependency relationship of the normal corpus, and gives the dependency relationship. That is, the conventional dependency parser cannot perform exception screening processing on the exception corpus. The dependence relationship analysis model of the embodiment of the invention can be used for screening the abnormal corpus, greatly reduces the labor cost and the time cost for manually screening the abnormal corpus, can avoid the normal dependence relationship analysis on the abnormal corpus, and effectively improves the accuracy of the corpus dependence relationship analysis and the efficiency of the corpus dependence relationship analysis.

According to the technical scheme, the dependency relationship analysis model can be obtained by performing dependency relationship analysis training on the participle sample data in the corpus sample data, the obtained dependency relationship analysis model can be used for performing exception screening processing on the exception corpus, the problems of high labor and time cost in the existing method for screening the exception corpus from the to-be-processed corpus are solved, exception screening processing is performed on the exception corpus through the trained dependency relationship analysis model, and the accuracy of corpus dependency relationship analysis and the efficiency of corpus dependency relationship analysis can be effectively improved.

Example two

Fig. 3 is a flowchart of a corpus processing method according to a second embodiment of the present invention, where the method is applicable to a situation where an abnormal corpus is screened from a corpus to be processed, and the method may be executed by a corpus processing apparatus, where the apparatus may be implemented by software and/or hardware, and may be generally integrated in a computer device, as shown in fig. 3, the method includes the following operations:

s210, obtaining the linguistic data to be processed, and inputting the linguistic data to be processed into the dependency relationship analysis model.

The corpus to be processed may need to screen out original corpus data of the abnormal corpus, and the corpus data may be voice data acquired in real time. It is understood that the corpus to be processed may include a plurality of sentences.

In the embodiment of the present invention, the obtained linguistic data to be processed may be input to the dependency relationship analysis model, so as to automatically screen the abnormal linguistic data from the expected data to be processed through the dependency relationship analysis model.

And S220, carrying out dependency relationship analysis on the linguistic data to be processed through a dependency relationship analysis model.

Correspondingly, after the obtained linguistic data to be processed is input into the dependency relationship analysis model, the dependency relationship analysis model can perform dependency relationship analysis on the linguistic data to be processed.

In an optional embodiment of the present invention, performing dependency analysis on the corpus to be processed through the dependency analysis model may include: performing word segmentation processing on the corpus to be processed through a dependency relationship analysis model to obtain corpus participles; performing part-of-speech tagging on the corpus participles to obtain a corpus part-of-speech tagging result; and determining the dependency relationship among the corpus participles according to the corpus part-of-speech tagging result.

The corpus participles can be participle data obtained by performing participle processing on the corpus to be processed. The result of the part-of-speech tagging of the corpus may be part-of-speech tagging data obtained by performing part-of-speech tagging on the corpus participles.

Because the dependency relationship analysis model has the word segmentation processing capability, the part-of-speech tagging processing capability and the dependency relationship analysis capability at the same time, when the dependency relationship analysis model performs dependency relationship analysis, word segmentation processing can be performed on a corpus to be processed to obtain corpus segmentation words, part-of-speech tagging processing can be performed on the obtained corpus segmentation words to obtain corpus part-of-speech tagging results, and finally, the dependency relationship among the corpus segmentation words can be determined according to the corpus part-of-speech tagging results.

In order to show the processing results of the dependency analysis model for performing the participle processing, the part-of-speech tagging processing and the dependency analysis between the participles of the corpus, the dependency analysis model may be used to analyze a plurality of sentences in the to-be-processed prediction, so as to obtain the following examples of the processing results of the to-be-processed corpus:

fig. 4 is a schematic diagram illustrating an effect of performing corpus processing through a dependency analysis model according to a second embodiment of the present invention. In a specific example, as shown in fig. 4, assuming that one of the sentences to be processed in the corpus to be processed is "there is a great potential for the development of the person", the dependency analysis model is used to perform dependency analysis on the sentence to be processed. The dependency relationship analysis model firstly performs word segmentation on the to-be-processed sentence to obtain each corpus participle, then performs part-of-speech tagging on the corpus participle to obtain a corpus part-of-speech tagging result, further determines the dependency relationship between the corpus participles according to the corpus part-of-speech tagging result, further performs tagging of the corpus participle dependency relationship according to the dependency relationship between the corpus participles, and finally obtains the dependency relationship tagging result shown in fig. 4. That is, there is ATT dependency between "this" and "person," there is SBV dependency "between" develop "and" have, "RAD dependency" between "giant" and "person," and RAD dependency between "person" and "person.

Fig. 5 is a schematic diagram illustrating an effect of performing corpus processing through a dependency analysis model according to a second embodiment of the present invention. In a specific example, as shown in fig. 5, assuming that one of the to-be-processed sentences in the to-be-processed corpus is "good for economic development of the new area of androsam", the dependency relationship analysis model is used to perform dependency relationship analysis on the to-be-processed sentence. The dependency relationship analysis model firstly performs word segmentation on the to-be-processed sentence to obtain each corpus participle, then performs part-of-speech tagging on the corpus participle to obtain a corpus part-of-speech tagging result, further determines the dependency relationship between the corpus participles according to the corpus part-of-speech tagging result, further performs tagging of the corpus participle dependency relationship according to the dependency relationship between the corpus participles, and finally obtains the dependency relationship tagging result of fig. 5. That is, there is ADV dependency between "very" and "good", SBV dependency between "develop" and "good", and ATT dependency between "new zone" and "develop".

Fig. 6 is a schematic diagram illustrating an effect of performing corpus processing through a dependency analysis model according to a second embodiment of the present invention. In a specific example, as shown in fig. 6, assuming that one of the to-be-processed sentences in the to-be-processed corpus is "the person is good for him", the dependency analysis model is used to perform dependency analysis on the to-be-processed sentence. The dependency relationship analysis model firstly performs word segmentation on the to-be-processed sentence to obtain each corpus participle, then performs part-of-speech tagging on the corpus participle to obtain a corpus part-of-speech tagging result, further determines the dependency relationship between the corpus participles according to the corpus part-of-speech tagging result, further performs tagging of the corpus participle dependency relationship according to the dependency relationship between the corpus participles, and finally obtains the dependency relationship tagging result of fig. 6. That is, there is ATT dependency between "he" and "person," ADV dependency between "very" and "good", and SBV dependency between "person" and "good".

Fig. 7 is a schematic diagram illustrating an effect of performing corpus processing through a dependency analysis model according to a second embodiment of the present invention. In a specific example, as shown in fig. 7, assuming that one of the sentences to be processed in the corpus to be processed is "how Beijing people like to eat roast ducks", the dependency relationship analysis model is used to perform dependency relationship analysis on the sentences to be processed. The dependency relationship analysis model firstly performs word segmentation on the to-be-processed sentence to obtain each corpus participle, then performs part-of-speech tagging on the corpus participle to obtain a corpus part-of-speech tagging result, further determines the dependency relationship between the corpus participles according to the corpus part-of-speech tagging result, further performs tagging of the corpus participle dependency relationship according to the dependency relationship between the corpus participles, and finally obtains the dependency relationship tagging result of fig. 7. Namely, ATT dependency relationship exists between 'Beijing' and 'person', VOB dependency relationship exists between 'eat' and 'roast duck', and SBV dependency relationship exists between 'person' and 'like'.

And S230, determining abnormal corpora from the corpora to be processed according to the analysis result of the dependency relationship analysis model.

The dependency relationship analysis model is a model obtained by training the model training method of any embodiment.

The abnormal corpus may be a corpus without actual meaning in the to-be-processed corpus.

Specifically, after the dependency relationship analysis model determines the dependency relationship between the corpus participles of each corpus to be processed, the abnormal corpus may be determined from the corpus to be processed according to the analysis result of the dependency relationship.

In an optional embodiment of the present invention, determining a dependency relationship between the corpus participles according to the corpus part-of-speech tagging result may include: determining the front-back sequence relation and/or the modification relation of each corpus participle according to the corpus part-of-speech tagging result; determining abnormal corpora from the corpora to be processed according to the analysis result of the dependency relationship analysis model, including: matching the front-back sequence relation and/or the modification relation of each corpus participle with the dependency relation stored in the dependency relation database; if the matching is determined to be successful, marking the linguistic data to be processed as normal linguistic data; otherwise, marking the linguistic data to be processed as abnormal linguistic data.

The normal corpus can indicate that the corpus to be processed is a corpus with correct grammar, clear semantics or actual significance. The abnormal corpus may indicate that the corpus to be processed is a corpus that may have problems of grammatical errors, unclear semantics, or no practical meaning.

Correspondingly, the dependency relationship analysis model may first determine a front-back order relationship and/or a modification relationship of the corpus participles according to the corpus part-of-speech tagging result, then match the front-back order relationship and/or the modification relationship of each corpus participle with the dependency relationship stored in the dependency relationship database, and mark the corpus to be processed as the normal corpus if it is determined that the front-back order relationship and/or the modification relationship of each corpus participle is successfully matched with the dependency relationship stored in the dependency relationship database. For example, the corpus to be processed is marked as 1, which represents that the corpus to be processed is expected to be normal corpus. Otherwise, determining that the front-back sequence relation and/or the modification relation of each corpus participle fails to be matched with the dependency relation stored in the dependency relation database, and marking the corpus to be processed as the abnormal corpus. For example, the to-be-processed corpus is marked as 0, which represents that the to-be-processed corpus is expected to be abnormal corpus. The matching failure may be that the front-back order relationship and/or the modification relationship of each corpus participle are partially or completely not matched with the dependency relationship stored in the dependency relationship database.

For example, assuming that the corpus to be processed is "human new-region roast duck", the dependencies between "human" and "new region", "human" and "roast duck" and between "new region" and "roast duck" in the corpus to be processed cannot be matched to the dependencies stored in the dependency database. That is, there is no mutual modification relationship, no dependency relationship, and no syntax confusion between the corpus participles of the corpus to be processed, so that the corpus to be processed can be marked as an abnormal corpus.

Fig. 8 is a flowchart of a corpus processing method based on dependency relationship analysis according to a second embodiment of the present invention. In a specific example, as shown in fig. 8, in a training stage of the dependency relationship analysis model, word segmentation processing is performed on standard corpus sample data to obtain word segmentation sample data, then part-of-speech tagging is performed on the word segmentation sample data to obtain part-of-speech tagging result sample data, the dependency relationship among the word segmentation samples is tagged according to the part-of-speech tagging result sample data, and the word segmentation sample data, the part-of-speech tagging result sample data, and the dependency relationship among the word segmentation samples are sequentially trained on a preset machine learning model to obtain the dependency relationship analysis model. In the corpus processing stage, the corpus to be processed may be directly input into the dependency relationship analysis model for dependency relationship analysis, and it is determined whether the dependency relationship between a pair of corpus participles in the corpus to be processed does not match the dependency relationship stored in the dependency relationship database. If the front-back sequence relation and/or the modification relation of each corpus participle are successfully matched with the dependency relation stored in the dependency relation database, marking the corpus to be processed as a normal corpus; otherwise, marking the linguistic data to be processed as abnormal linguistic data. Considering that the dependency analysis model cannot achieve one hundred percent accuracy, there may be a case where the normal corpus is marked as an abnormal corpus. Therefore, after all the linguistic data to be processed are screened out by using the dependency relationship analysis model, the abnormal linguistic data can be manually checked to ensure that the normal linguistic data are not mistakenly marked as abnormal linguistic data.

Therefore, by matching the front-back order relationship and/or the modification relationship of the corpus participles with the dependency relationship stored in the dependency relationship database, as long as the trained standard corpus sample data is regular and the corpus with correct grammar is ensured, the dependency relationship analysis model cannot give the analysis result of the dependency relationship of the abnormal corpus, and the front-back order relationship and/or the modification relationship of the corpus participles cannot be matched with the dependency relationship stored in the dependency relationship database, so the dependency relationship analysis model cannot give the specific dependency relationship type between the corpus participles of the abnormal corpus.

In the embodiment of the invention, the linguistic data to be processed is input into the dependency relationship analysis model for dependency relationship analysis, and the abnormal linguistic data in the linguistic data to be processed is found out according to the analysis result of the dependency relationship analysis model, so that the problem that the existing dependency relationship analyzer cannot effectively identify the abnormal linguistic data is solved, the purpose of automatically screening the abnormal linguistic data is achieved, the labor cost and the time cost of linguistic data processing are reduced, and the accuracy and the analysis efficiency of the linguistic data dependency relationship analysis are improved.

EXAMPLE III

Fig. 9 is a schematic diagram of a model training apparatus according to a third embodiment of the present invention, and as shown in fig. 9, the apparatus includes: a corpus sample data obtaining module 310 and a dependency relationship analysis training module 320, wherein:

a corpus sample data obtaining module 310, configured to obtain corpus sample data with dependency relationship among the word segmentation samples;

the dependency relationship analysis training module 320 is configured to input the corpus sample data into a preset machine learning model for dependency relationship analysis training to obtain a dependency relationship analysis model; and the dependency relationship analysis model is used for carrying out exception screening processing on the exception corpora.

Optionally, the corpus sample data obtaining module includes: the sample data acquisition unit is used for acquiring standard corpus sample data; the sample data word segmentation processing unit is used for carrying out word segmentation processing on the corpus sample data to obtain word segmentation samples; the part-of-speech tagging unit is used for performing part-of-speech tagging processing on the part-of-speech samples to obtain part-of-speech tagging result sample data; and the word segmentation sample dependency relationship determining unit is used for determining the dependency relationship among the word segmentation samples according to the part of speech tagging result sample data to obtain the corpus sample data.

Optionally, the word segmentation sample dependency relationship determining unit may be specifically configured to: determining the front-back sequence relation and/or the modification relation of each word segmentation sample according to the part of speech tagging result sample data; and determining the dependency relationship among the word segmentation samples according to the front-back sequence relationship and/or the modification relationship of the word segmentation samples.

Optionally, the apparatus further comprises: and the dependency relationship database establishing module is used for establishing a dependency relationship database according to the dependency relationship among the word segmentation samples.

Optionally, the dependency analysis training module 320 is specifically configured to input the word segmentation data into the preset machine learning model for word segmentation training; and inputting the part-of-speech tagging result sample data into the preset machine learning model for part-of-speech tagging training.

The model training device can execute the model training method provided by any embodiment of the invention, and has corresponding functional modules and beneficial effects of the execution method. For details of the technique not described in detail in this embodiment, reference may be made to the model training method provided in any embodiment of the present invention.

Since the above-described model training device is a device capable of executing the model training method in the embodiment of the present invention, based on the model training method described in the embodiment of the present invention, a person skilled in the art can understand the specific implementation of the model training device in the embodiment and various variations thereof, and therefore, how the model training device implements the model training method in the embodiment of the present invention is not described in detail herein. The scope of the present application is intended to cover any apparatus used by those skilled in the art to implement the method for training models in the embodiments of the present invention.

Example four

Fig. 10 is a schematic diagram of a corpus processing apparatus according to a fourth embodiment of the present invention, and as shown in fig. 10, the apparatus includes: a to-be-processed corpus obtaining module 410, a dependency relationship analyzing module 420, and an abnormal corpus determining module 430, wherein:

a to-be-processed corpus obtaining module 410, configured to obtain a to-be-processed corpus, and input the to-be-processed corpus into a dependency relationship analysis model;

a dependency relationship analysis module 420, configured to perform dependency relationship analysis on the corpus to be processed through the dependency relationship analysis model;

an abnormal corpus determining module 430, configured to determine an abnormal corpus from the to-be-processed corpus according to an analysis result of the dependency relationship analysis model;

the dependency relationship analysis model is a model obtained by training the model training method in any embodiment.

Optionally, the dependency relationship analysis module includes: the corpus participle processing unit is used for carrying out participle processing on the corpus to be processed through the dependency relationship analysis model to obtain corpus participles; the part-of-speech tagging processing unit is used for performing part-of-speech tagging processing on the corpus participles to obtain a corpus part-of-speech tagging result; and the dependency relationship determining unit is used for determining the dependency relationship among the language material participles according to the language material part-of-speech tagging result.

Optionally, the dependency relationship determining unit is specifically configured to: determining the front-back sequence relation and/or the modification relation of each corpus participle according to the corpus part-of-speech tagging result; the abnormal corpus determining module 430 is specifically configured to: matching the front-back sequence relation and/or the modification relation of each language material word segmentation with the dependency relation stored in a dependency relation database; if the matching is determined to be successful, marking the linguistic data to be processed as normal linguistic data; otherwise, marking the linguistic data to be processed as abnormal linguistic data.

The corpus processing device can execute the corpus processing method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the execution method. For technical details that are not described in detail in this embodiment, reference may be made to the corpus processing method provided in any embodiment of the present invention.

Since the above-described corpus processing apparatus is an apparatus capable of executing the corpus processing method in the embodiment of the present invention, based on the corpus processing method described in the embodiment of the present invention, those skilled in the art can understand the specific implementation manner of the corpus processing apparatus in the embodiment and various variations thereof, and therefore, how the corpus processing apparatus implements the corpus processing method in the embodiment of the present invention is not described in detail herein. The apparatus used by those skilled in the art to implement the corpus processing method in the embodiments of the present invention all fall within the scope of the present application.

EXAMPLE five

Fig. 11 is a schematic structural diagram of a computer device according to a fifth embodiment of the present invention. FIG. 11 illustrates a block diagram of a computer device 512 suitable for use in implementing embodiments of the present invention. The computer device 512 shown in FIG. 11 is only an example and should not bring any limitations to the functionality or scope of use of embodiments of the present invention.

As shown in FIG. 11, computer device 512 is in the form of a general purpose computing device. Components of computer device 512 may include, but are not limited to: one or more processors 516, a storage device 528, and a bus 518 that couples the various system components including the storage device 528 and the processors 516.

Bus 518 represents one or more of any of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, and a processor or local bus using any of a variety of bus architectures. By way of example, such architectures include, but are not limited to, an Industry Standard Architecture (ISA) bus, a Micro Channel Architecture (MCA) bus, an enhanced ISA bus, a Video Electronics Standards Association (VESA) local bus, and a Peripheral Component Interconnect (PCI) bus.

Computer device 512 typically includes a variety of computer system readable media. Such media can be any available media that is accessible by computer device 512 and includes both volatile and nonvolatile media, removable and non-removable media.

Storage 528 may include computer system readable media in the form of volatile Memory, such as Random Access Memory (RAM) 530 and/or cache Memory 532. The computer device 512 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 534 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 11, and commonly referred to as a "hard drive"). Although not shown in FIG. 11, a magnetic disk drive for reading from and writing to a removable, nonvolatile magnetic disk (e.g., a "floppy disk") and an optical disk drive for reading from or writing to a removable, nonvolatile optical disk (e.g., a Compact disk-Read Only Memory (CD-ROM), a Digital Video disk (DVD-ROM), or other optical media) may be provided. In these cases, each drive may be connected to bus 518 through one or more data media interfaces. Storage 528 may include at least one program product having a set (e.g., at least one) of program modules that are configured to carry out the functions of embodiments of the invention.

Program 536 having a set (at least one) of program modules 526 may be stored, for example, in storage 528, such program modules 526 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each of which examples or some combination may include an implementation of a network environment. Program modules 526 generally perform the functions and/or methodologies of the described embodiments of the invention.

Computer device 512 may also communicate with one or more external devices 514 (e.g., keyboard, pointing device, camera, display 524, etc.), with one or more devices that enable a user to interact with computer device 512, and/or with any devices (e.g., network card, modem, etc.) that enable computer device 512 to communicate with one or more other computing devices. Such communication may be through an Input/Output (I/O) interface 522. Further, computer device 512 may also communicate with one or more networks (e.g., a Local Area Network (LAN), Wide Area Network (WAN), and/or a public Network, such as the internet) via Network adapter 520. As shown, the network adapter 520 communicates with the other modules of the computer device 512 via the bus 518. It should be appreciated that although not shown, other hardware and/or software modules may be used in conjunction with the computer device 512, including but not limited to: microcode, device drivers, Redundant processing units, external disk drive Arrays, disk array (RAID) systems, tape drives, and data backup storage systems, to name a few.

The processor 516 executes various functional applications and data processing by running programs stored in the storage device 528, for example, implementing the model training method provided by the above-described embodiment of the present invention: obtaining corpus sample data with dependency relationship among word segmentation samples; inputting the corpus sample data into a preset machine learning model for dependency relationship analysis training to obtain a dependency relationship analysis model; and the dependency relationship analysis model is used for carrying out exception screening processing on the exception corpora. Or, implementing the corpus processing method provided by the above embodiment of the present invention: obtaining a corpus to be processed, and inputting the corpus to be processed into a dependency relationship analysis model; performing dependency relationship analysis on the linguistic data to be processed through the dependency relationship analysis model; determining abnormal corpora from the corpora to be processed according to the analysis result of the dependency relationship analysis model; the dependency relationship analysis model is a model obtained by training the model training method in any embodiment.

EXAMPLE six

An embodiment of the present invention further provides a computer storage medium storing a computer program, which when executed by a computer processor is configured to perform the model training method according to any one of the above embodiments of the present invention: obtaining corpus sample data with dependency relationship among word segmentation samples; inputting the corpus sample data into a preset machine learning model for dependency relationship analysis training to obtain a dependency relationship analysis model; and the dependency relationship analysis model is used for carrying out exception screening processing on the exception corpora. Or, the method for processing corpus according to any of the above embodiments of the present invention: obtaining a corpus to be processed, and inputting the corpus to be processed into a dependency relationship analysis model; performing dependency relationship analysis on the linguistic data to be processed through the dependency relationship analysis model; determining abnormal corpora from the corpora to be processed according to the analysis result of the dependency relationship analysis model; the dependency relationship analysis model is obtained by training the model training method according to any embodiment of the invention.

Computer storage media for embodiments of the invention may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM) or flash Memory), an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, Radio Frequency (RF), etc., or any suitable combination of the foregoing. Computer program code for carrying out operations for aspects of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method of model training, comprising:

2. The method according to claim 1, wherein the obtaining corpus sample data with dependency relationship among participle samples comprises:

acquiring standard corpus sample data;

performing word segmentation processing on the standard corpus sample data to obtain word segmentation sample data;

performing part-of-speech tagging processing on the word segmentation sample data to obtain part-of-speech tagging result sample data;

and determining the dependency relationship among the word segmentation samples in the word segmentation sample data according to the part of speech tagging result sample data to obtain the corpus sample data.

3. The method according to claim 2, wherein the determining the dependency relationship between the word segmentation samples in the word segmentation sample data according to the part of speech tagging result sample data includes:

determining the front-back sequence relation and/or the modification relation of each word segmentation sample according to the part of speech tagging result sample data;

and determining the dependency relationship among the word segmentation samples according to the front-back sequence relationship and/or the modification relationship of the word segmentation samples.

4. The method according to claim 3, wherein after determining the dependency relationship between the participle samples according to the before-after order relationship and/or the embellishment relationship of the participle samples, the method further comprises:

and establishing a dependency relationship database according to the dependency relationship among the word segmentation samples.

5. The method according to any one of claims 2-4, wherein before the inputting the corpus sample data into a preset machine learning model for dependency analysis training, the method further comprises:

inputting the word segmentation sample data into the preset machine learning model for word segmentation training;

and inputting the part-of-speech tagging result sample data into the preset machine learning model for part-of-speech tagging training.

6. A corpus processing method, comprising:

wherein, the dependency analysis model is a model obtained by training according to the model training method of any one of claims 1 to 5.

7. The method according to claim 6, wherein the performing dependency analysis on the corpus to be processed by the dependency analysis model includes:

performing word segmentation processing on the corpus to be processed through the dependency relationship analysis model to obtain corpus word segmentation;

performing part-of-speech tagging processing on the corpus participles to obtain a corpus part-of-speech tagging result;

and determining the dependency relationship among the language material participles according to the language material part-of-speech tagging result.

8. The method according to claim 7, wherein the determining the dependency relationship between the corpus participles according to the corpus part-of-speech tagging result comprises:

determining the front-back sequence relation and/or the modification relation of each corpus participle according to the corpus part-of-speech tagging result;

determining abnormal corpora from the corpora to be processed according to the analysis result of the dependency relationship analysis model, including:

matching the front-back sequence relation and/or the modification relation of each language material word segmentation with the dependency relation stored in a dependency relation database;

if the matching is determined to be successful, marking the linguistic data to be processed as normal linguistic data;

otherwise, marking the linguistic data to be processed as abnormal linguistic data.

9. A model training apparatus, comprising:

10. The apparatus according to claim 9, wherein the corpus sample data obtaining module comprises:

the sample data acquisition unit is used for acquiring standard corpus sample data;

the sample data word segmentation processing unit is used for carrying out word segmentation processing on the corpus sample data to obtain word segmentation samples;

the word segmentation sample data part-of-speech tagging unit is used for carrying out part-of-speech tagging processing on the word segmentation sample data to obtain part-of-speech tagging result sample data;

and the word segmentation sample dependency relationship determining unit is used for determining the dependency relationship among the word segmentation samples according to the part of speech tagging result sample data to obtain the corpus sample data.

11. The apparatus of claim 10, wherein the participle sample dependency determination unit is configured to:

12. The apparatus of claim 11, further comprising:

and the dependency relationship database establishing module is used for establishing a dependency relationship database according to the dependency relationship among the word segmentation samples.

13. The apparatus according to any of claims 10-12, wherein the dependency analysis training module is further configured to:

14. A corpus processing apparatus, comprising:

15. The apparatus of claim 14, wherein the dependency analysis module comprises:

the corpus participle processing unit is used for carrying out participle processing on the corpus to be processed through the dependency relationship analysis model to obtain corpus participles;

the part-of-speech tagging processing unit is used for performing part-of-speech tagging processing on the corpus participles to obtain a corpus part-of-speech tagging result;

and the dependency relationship determining unit is used for determining the dependency relationship among the language material participles according to the language material part-of-speech tagging result.

16. The apparatus of claim 15, wherein the dependency determination unit is further configured to:

17. A computer device, characterized in that the computer device comprises:

one or more processors;

storage means for storing one or more programs;

when executed by the one or more processors, cause the one or more processors to implement a model training method as claimed in any one of claims 1-5, or to implement a corpus processing method as claimed in any one of claims 6-8.