CN107451295B

CN107451295B - Method for obtaining deep learning training data based on grammar network

Info

Publication number: CN107451295B
Application number: CN201710708706.7A
Authority: CN
Inventors: 张超; 周红; 刘楚雄
Original assignee: Sichuan Changhong Electric Co Ltd
Current assignee: Sichuan Changhong Electric Co Ltd
Priority date: 2017-08-17
Filing date: 2017-08-17
Publication date: 2020-06-30
Anticipated expiration: 2037-08-17
Also published as: CN107451295A

Abstract

The invention discloses a method for acquiring deep learning training data based on a grammar network, which generates a large amount of language data by crawling data through a reverse grammar network and a crawler, firstly, the crawler in the vertical field is used for crawling data meeting the requirement and storing the data, then grammar network rule sentences are written according to the requirement, the language data and corresponding label data can be acquired through the grammar network rule sentences, a large amount of language data can be generated by expanding the grammar network sentences or combining the grammar network sentences with the crawling data, and the generated language data and the label data corresponding to the language data can be respectively used as deep learning model training input and output. The invention obtains a large amount of data which can be directly used for deep learning model training by reversely using the grammar network rules, the language data is more smooth and has huge amount, and meanwhile, the invention can also obtain the label sentences of the sentences, thereby being very suitable for deep learning model training.

Description

Method for obtaining deep learning training data based on grammar network

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method for acquiring deep learning training data based on a grammar network.

Background

With the rise of artificial intelligence, natural language processing is an important direction in the field of artificial intelligence, and mainly researches theories and methods for people and computers to communicate through natural language, while a neural network is a mathematical model for simulating human neural functions and structures, and makes breakthrough progress in the fields of image recognition and voice recognition of artificial intelligence, and deep learning is derived from artificial neural network research, is a method for characterizing data of machine learning, and is also an important method in natural language processing. In recent years, deep learning has a lot of breakthrough achievements in processing English natural language, and a neural network based on deep learning is a main means for solving the problems. Deep learning requires a large number of effective data training models in required fields, and how to quickly obtain accurate and effective data becomes a key for improving system performance and efficiency.

At present, the existing deep learning is restricted by training data and has great limitation, as is well known, the training data of the deep learning is divided into two parts, one part is input sentences, the other part is output label sentences, the quantity of the training data and how to obtain the label sentences are difficult problems of people, the conventional input sentences and the label sentences are simply spliced or manually written, and as a result, either the sentences are not smooth or the quantity is too small, and the popularization and application of the deep learning are restricted. Grammar networks, as a rule of conventional language processing, are used to let machines understand human languages by simply doing some simple language processing work through forward use.

Disclosure of Invention

Aiming at the defects in the prior art, the invention aims to provide a method for acquiring deep learning training data based on a grammar network, which is used for acquiring a large amount of data which can be directly used for deep learning model training by reversely using grammar network rules.

The purpose of the invention is realized by the following technical scheme:

a method for obtaining deep learning training data based on a grammar network comprises the following steps:

A. the method comprises the steps that basic data of a required field are directionally crawled by means of a web crawler, and the basic data are obtained by means of a vertical field distributed crawler;

B. b, compiling a grammar network rule statement for the basic data in the step A;

C. b, combining the crawled basic data with the grammar network rule sentences in the step B, and generating output language data through a reverse grammar network program;

D. generating a label statement corresponding to the output statement according to the sub-rule name of the obtained grammar network rule statement;

E. and C, generating a large amount of language data by combining grammar network rule sentences with the crawled basic data, wherein the language data generated in the step C and the label data corresponding to the step D are respectively used as deep learning model training input data and output data.

The invention obtains a large amount of input sentences and output sentences of deep learning training based on a web crawler and a reverse grammar network, and comprises the following steps:

a. firstly, acquiring basic data in a required field by using a web crawler technology, and taking the data as a sub-rule of a reverse grammar rule; taking the film and television field as an example:

1) firstly, a web crawler is utilized to acquire and store movie and television related information, such as movie names, stars and the like, the star data is defined as y _ celebrity and is used as a sub rule of a grammar statement, the y _ celebrity is used as the name of the sub rule, and the content of the sub rule is a specific star name, such as "Liu De Hua".

2) Data crawled by a web crawler needs to be cleaned before being used, because the use of the data can be influenced if special symbols are contained in the data.

b. And writing the rule statement of the reverse grammar network according to the requirement, wherein the rule statement of the reverse grammar network is formed by sub-rules, and the sub-rules comprise two parts of rule names and rule contents. Taking the film and television field as an example:

as a rule statement of the written reverse grammar network, input is (n _ prop) (n _ input) (y _ v) (y _ celebrity) (n _ d) (y _ movie), and input is a rule name of the statement, and "right" is a sub-rule constituting the statement, as follows: the left and right of "═ are the names and contents of the sub-rules, respectively:

n _ prop ═ i

n _ ent ═ want

See (y _ v)

y _ celebrity ═ Liudebua

n _ d ═

Movie

The content of the sub rule statement of y _ celebrity is the name of the star grabbed by the web crawler, and the content of the sub statement is "Liu De Hua". The statement of the 'Liudebua movie wanted to be watched' can be output by operating the reverse grammar network program, meanwhile, the reverse grammar network program can extract the rule name of each sub-rule statement in the rule statement, the label statement of the output statement can be obtained through the sub-rule names, if the label of the current statement can be represented as 'n vcelebty n movie', a piece of language data and corresponding label data are obtained through the reverse grammar network at the moment and can be used as the input and the output of deep learning respectively.

c. Only one piece of data can be generated by combining written grammar network rule sentences with a movie star, but a large amount of data can be generated by expanding grammar rules. As syntax of grammar rules:

the input (n _ pron) (n _ input) (y _ v) (y _ celebrity) (n _ d) (y _ movie) can be replaced by the star data crawled by the web crawler, so that different language data can be generated, and grammar rule sentences can be expanded to generate different sentences, so that the problem of insufficient deep learning data amount is solved.

Compared with the prior art, the invention has the following advantages and beneficial effects:

the invention obtains a large amount of data which can be directly used for deep learning model training by reversely using the grammar network rules, the language data is more smooth and has huge amount, and meanwhile, the invention can also obtain the label sentences of the sentences, thereby being very suitable for deep learning model training.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the following examples:

examples

As shown in fig. 1, a method for obtaining deep learning training data based on a grammar network includes the following steps:

E. and (3) generating a large amount of language data by combining grammar network rule sentences with the crawled basic data, wherein the language data generated in the step (C) and the label data corresponding to the step (D) are respectively used as deep learning model training input data and output data (namely reverse grammar network generation training data).

The method comprises the steps of generating a large amount of language data through a reverse grammar network and crawler crawling data, grabbing and storing data meeting requirements by using a vertical field network crawler, writing grammar network rule sentences according to the requirements, obtaining the language data and corresponding label data through the grammar network rule sentences, generating a large amount of language data through the extension of the grammar network sentences or the combination of the grammar network sentences and the crawling data, and enabling the generated language data and the corresponding label data to be input and output as deep learning model training respectively.

The reverse grammar network generation training data of the invention comprises the following steps: the web crawler crawls required data, writes grammar network rule sentences, obtains label sentences of the grammar rule sentences, and generates a large amount of language data in a mode of combining grammar rule sentence expansion or crawl data, wherein the specific work flow is as follows:

a) and (4) directionally crawling required field data by means of the web crawler and storing the data. The web crawler is a program or script written according to a certain rule, and can capture network information according to the requirement; the web crawlers are generally divided into vertical field crawlers and horizontal field crawlers, the vertical field crawlers are adopted to acquire data in the application, the vertical field distributed web crawlers crawl network information according to a certain theme, the data crawled by the crawlers are in accordance with the required theme, the accuracy is high, and meanwhile, the data can be acquired rapidly in a large number.

b) Writing grammar network rule statements as required as follows:

input [ "check" ] [ "see" ] "train ticket"

The rule then generates the following statement: the method comprises the steps of checking the railway ticket, looking at the railway ticket, checking the railway ticket and checking the railway ticket, so that different grammar network rules can be written according to different requirements, and language data corresponding to the rules can be written by using a reverse grammar network rule statement.

c) And obtaining a corresponding label statement according to the grammar network rule statement as follows:

input [ check ] [ view ] database the syntax network rule is composed of a plurality of sub-rules, such as: check, view, and database are names of sub-rules, whose contents are as follows:

check ═ inquiry "

View is a view "

If the database is equal to the train ticket, then the reverse grammar network program is run to obtain an output sentence "check the train ticket", meanwhile, the program extracts the name of the sub-rule to generate a corresponding label sentence "check view database", and then the output sentence and the label sentence can be respectively used as the input and the output of deep learning, thereby solving the problems of inaccurate deep learning data and difficulty in obtaining corresponding output data.

d) A large amount of linguistic data may be generated by extending the rules of a grammar network or in combination with crawling data, as follows:

the grammar rule is as follows: input [ "look" ] database, where database can be combined with crawl data, as in the way it is combined with crawl data: the data base is the train ticket, the data base is the bus ticket, the data base is the airplane ticket, and the like, so that a large amount of required data can be generated, and if the data of 'checking the train ticket', 'checking the bus ticket', 'checking the airplane ticket' can be generated respectively, and the problem of insufficient deep learning data amount is solved.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims

1. A method for obtaining deep learning training data based on a grammar network is characterized in that: the method comprises the following steps:

C. b, combining the crawled basic data with the grammar network rule statements in the step B, and generating output language data through a reverse grammar network program, wherein the rule statements of the reverse grammar network are formed by sub-rules, and the sub-rules comprise two parts of rule names and rule contents;