CN107451295B - Method for obtaining deep learning training data based on grammar network - Google Patents

Method for obtaining deep learning training data based on grammar network Download PDF

Info

Publication number
CN107451295B
CN107451295B CN201710708706.7A CN201710708706A CN107451295B CN 107451295 B CN107451295 B CN 107451295B CN 201710708706 A CN201710708706 A CN 201710708706A CN 107451295 B CN107451295 B CN 107451295B
Authority
CN
China
Prior art keywords
data
rule
grammar network
deep learning
grammar
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710708706.7A
Other languages
Chinese (zh)
Other versions
CN107451295A (en
Inventor
张超
周红
刘楚雄
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Changhong Electric Co Ltd
Original Assignee
Sichuan Changhong Electric Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Changhong Electric Co Ltd filed Critical Sichuan Changhong Electric Co Ltd
Priority to CN201710708706.7A priority Critical patent/CN107451295B/en
Publication of CN107451295A publication Critical patent/CN107451295A/en
Application granted granted Critical
Publication of CN107451295B publication Critical patent/CN107451295B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a method for acquiring deep learning training data based on a grammar network, which generates a large amount of language data by crawling data through a reverse grammar network and a crawler, firstly, the crawler in the vertical field is used for crawling data meeting the requirement and storing the data, then grammar network rule sentences are written according to the requirement, the language data and corresponding label data can be acquired through the grammar network rule sentences, a large amount of language data can be generated by expanding the grammar network sentences or combining the grammar network sentences with the crawling data, and the generated language data and the label data corresponding to the language data can be respectively used as deep learning model training input and output. The invention obtains a large amount of data which can be directly used for deep learning model training by reversely using the grammar network rules, the language data is more smooth and has huge amount, and meanwhile, the invention can also obtain the label sentences of the sentences, thereby being very suitable for deep learning model training.

Description

Method for obtaining deep learning training data based on grammar network
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method for acquiring deep learning training data based on a grammar network.
Background
With the rise of artificial intelligence, natural language processing is an important direction in the field of artificial intelligence, and mainly researches theories and methods for people and computers to communicate through natural language, while a neural network is a mathematical model for simulating human neural functions and structures, and makes breakthrough progress in the fields of image recognition and voice recognition of artificial intelligence, and deep learning is derived from artificial neural network research, is a method for characterizing data of machine learning, and is also an important method in natural language processing. In recent years, deep learning has a lot of breakthrough achievements in processing English natural language, and a neural network based on deep learning is a main means for solving the problems. Deep learning requires a large number of effective data training models in required fields, and how to quickly obtain accurate and effective data becomes a key for improving system performance and efficiency.
At present, the existing deep learning is restricted by training data and has great limitation, as is well known, the training data of the deep learning is divided into two parts, one part is input sentences, the other part is output label sentences, the quantity of the training data and how to obtain the label sentences are difficult problems of people, the conventional input sentences and the label sentences are simply spliced or manually written, and as a result, either the sentences are not smooth or the quantity is too small, and the popularization and application of the deep learning are restricted. Grammar networks, as a rule of conventional language processing, are used to let machines understand human languages by simply doing some simple language processing work through forward use.
Disclosure of Invention
Aiming at the defects in the prior art, the invention aims to provide a method for acquiring deep learning training data based on a grammar network, which is used for acquiring a large amount of data which can be directly used for deep learning model training by reversely using grammar network rules.
The purpose of the invention is realized by the following technical scheme:
a method for obtaining deep learning training data based on a grammar network comprises the following steps:
A. the method comprises the steps that basic data of a required field are directionally crawled by means of a web crawler, and the basic data are obtained by means of a vertical field distributed crawler;
B. b, compiling a grammar network rule statement for the basic data in the step A;
C. b, combining the crawled basic data with the grammar network rule sentences in the step B, and generating output language data through a reverse grammar network program;
D. generating a label statement corresponding to the output statement according to the sub-rule name of the obtained grammar network rule statement;
E. and C, generating a large amount of language data by combining grammar network rule sentences with the crawled basic data, wherein the language data generated in the step C and the label data corresponding to the step D are respectively used as deep learning model training input data and output data.
The invention obtains a large amount of input sentences and output sentences of deep learning training based on a web crawler and a reverse grammar network, and comprises the following steps:
a. firstly, acquiring basic data in a required field by using a web crawler technology, and taking the data as a sub-rule of a reverse grammar rule; taking the film and television field as an example:
1) firstly, a web crawler is utilized to acquire and store movie and television related information, such as movie names, stars and the like, the star data is defined as y _ celebrity and is used as a sub rule of a grammar statement, the y _ celebrity is used as the name of the sub rule, and the content of the sub rule is a specific star name, such as "Liu De Hua".
2) Data crawled by a web crawler needs to be cleaned before being used, because the use of the data can be influenced if special symbols are contained in the data.
b. And writing the rule statement of the reverse grammar network according to the requirement, wherein the rule statement of the reverse grammar network is formed by sub-rules, and the sub-rules comprise two parts of rule names and rule contents. Taking the film and television field as an example:
as a rule statement of the written reverse grammar network, input is (n _ prop) (n _ input) (y _ v) (y _ celebrity) (n _ d) (y _ movie), and input is a rule name of the statement, and "right" is a sub-rule constituting the statement, as follows: the left and right of "═ are the names and contents of the sub-rules, respectively:
n _ prop ═ i
n _ ent ═ want
See (y _ v)
y _ celebrity ═ Liudebua
n _ d ═
Movie
The content of the sub rule statement of y _ celebrity is the name of the star grabbed by the web crawler, and the content of the sub statement is "Liu De Hua". The statement of the 'Liudebua movie wanted to be watched' can be output by operating the reverse grammar network program, meanwhile, the reverse grammar network program can extract the rule name of each sub-rule statement in the rule statement, the label statement of the output statement can be obtained through the sub-rule names, if the label of the current statement can be represented as 'n vcelebty n movie', a piece of language data and corresponding label data are obtained through the reverse grammar network at the moment and can be used as the input and the output of deep learning respectively.
c. Only one piece of data can be generated by combining written grammar network rule sentences with a movie star, but a large amount of data can be generated by expanding grammar rules. As syntax of grammar rules:
the input (n _ pron) (n _ input) (y _ v) (y _ celebrity) (n _ d) (y _ movie) can be replaced by the star data crawled by the web crawler, so that different language data can be generated, and grammar rule sentences can be expanded to generate different sentences, so that the problem of insufficient deep learning data amount is solved.
Compared with the prior art, the invention has the following advantages and beneficial effects:
the invention obtains a large amount of data which can be directly used for deep learning model training by reversely using the grammar network rules, the language data is more smooth and has huge amount, and meanwhile, the invention can also obtain the label sentences of the sentences, thereby being very suitable for deep learning model training.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the following examples:
examples
As shown in fig. 1, a method for obtaining deep learning training data based on a grammar network includes the following steps:
A. the method comprises the steps that basic data of a required field are directionally crawled by means of a web crawler, and the basic data are obtained by means of a vertical field distributed crawler;
B. b, compiling a grammar network rule statement for the basic data in the step A;
C. b, combining the crawled basic data with the grammar network rule sentences in the step B, and generating output language data through a reverse grammar network program;
D. generating a label statement corresponding to the output statement according to the sub-rule name of the obtained grammar network rule statement;
E. and (3) generating a large amount of language data by combining grammar network rule sentences with the crawled basic data, wherein the language data generated in the step (C) and the label data corresponding to the step (D) are respectively used as deep learning model training input data and output data (namely reverse grammar network generation training data).
The method comprises the steps of generating a large amount of language data through a reverse grammar network and crawler crawling data, grabbing and storing data meeting requirements by using a vertical field network crawler, writing grammar network rule sentences according to the requirements, obtaining the language data and corresponding label data through the grammar network rule sentences, generating a large amount of language data through the extension of the grammar network sentences or the combination of the grammar network sentences and the crawling data, and enabling the generated language data and the corresponding label data to be input and output as deep learning model training respectively.
The reverse grammar network generation training data of the invention comprises the following steps: the web crawler crawls required data, writes grammar network rule sentences, obtains label sentences of the grammar rule sentences, and generates a large amount of language data in a mode of combining grammar rule sentence expansion or crawl data, wherein the specific work flow is as follows:
a) and (4) directionally crawling required field data by means of the web crawler and storing the data. The web crawler is a program or script written according to a certain rule, and can capture network information according to the requirement; the web crawlers are generally divided into vertical field crawlers and horizontal field crawlers, the vertical field crawlers are adopted to acquire data in the application, the vertical field distributed web crawlers crawl network information according to a certain theme, the data crawled by the crawlers are in accordance with the required theme, the accuracy is high, and meanwhile, the data can be acquired rapidly in a large number.
b) Writing grammar network rule statements as required as follows:
input [ "check" ] [ "see" ] "train ticket"
The rule then generates the following statement: the method comprises the steps of checking the railway ticket, looking at the railway ticket, checking the railway ticket and checking the railway ticket, so that different grammar network rules can be written according to different requirements, and language data corresponding to the rules can be written by using a reverse grammar network rule statement.
c) And obtaining a corresponding label statement according to the grammar network rule statement as follows:
input [ check ] [ view ] database the syntax network rule is composed of a plurality of sub-rules, such as: check, view, and database are names of sub-rules, whose contents are as follows:
check ═ inquiry "
View is a view "
If the database is equal to the train ticket, then the reverse grammar network program is run to obtain an output sentence "check the train ticket", meanwhile, the program extracts the name of the sub-rule to generate a corresponding label sentence "check view database", and then the output sentence and the label sentence can be respectively used as the input and the output of deep learning, thereby solving the problems of inaccurate deep learning data and difficulty in obtaining corresponding output data.
d) A large amount of linguistic data may be generated by extending the rules of a grammar network or in combination with crawling data, as follows:
the grammar rule is as follows: input [ "look" ] database, where database can be combined with crawl data, as in the way it is combined with crawl data: the data base is the train ticket, the data base is the bus ticket, the data base is the airplane ticket, and the like, so that a large amount of required data can be generated, and if the data of 'checking the train ticket', 'checking the bus ticket', 'checking the airplane ticket' can be generated respectively, and the problem of insufficient deep learning data amount is solved.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.

Claims (1)

1. A method for obtaining deep learning training data based on a grammar network is characterized in that: the method comprises the following steps:
A. the method comprises the steps that basic data of a required field are directionally crawled by means of a web crawler, and the basic data are obtained by means of a vertical field distributed crawler;
B. b, compiling a grammar network rule statement for the basic data in the step A;
C. b, combining the crawled basic data with the grammar network rule statements in the step B, and generating output language data through a reverse grammar network program, wherein the rule statements of the reverse grammar network are formed by sub-rules, and the sub-rules comprise two parts of rule names and rule contents;
D. generating a label statement corresponding to the output statement according to the sub-rule name of the obtained grammar network rule statement;
E. and C, generating a large amount of language data by combining grammar network rule sentences with the crawled basic data, wherein the language data generated in the step C and the label data corresponding to the step D are respectively used as deep learning model training input data and output data.
CN201710708706.7A 2017-08-17 2017-08-17 Method for obtaining deep learning training data based on grammar network Active CN107451295B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710708706.7A CN107451295B (en) 2017-08-17 2017-08-17 Method for obtaining deep learning training data based on grammar network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710708706.7A CN107451295B (en) 2017-08-17 2017-08-17 Method for obtaining deep learning training data based on grammar network

Publications (2)

Publication Number Publication Date
CN107451295A CN107451295A (en) 2017-12-08
CN107451295B true CN107451295B (en) 2020-06-30

Family

ID=60492382

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710708706.7A Active CN107451295B (en) 2017-08-17 2017-08-17 Method for obtaining deep learning training data based on grammar network

Country Status (1)

Country Link
CN (1) CN107451295B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109766540B (en) * 2018-12-10 2022-05-03 平安科技(深圳)有限公司 General text information extraction method and device, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105512285A (en) * 2015-12-07 2016-04-20 南京大学 Self-adaption web crawler method based on machine learning
CN105786782A (en) * 2016-03-25 2016-07-20 北京搜狗科技发展有限公司 Word vector training method and device
CN105894088A (en) * 2016-03-25 2016-08-24 苏州赫博特医疗信息科技有限公司 Medical information extraction system and method based on depth learning and distributed semantic features
CN106156056A (en) * 2015-03-27 2016-11-23 联想(北京)有限公司 A kind of Text Mode learning method and electronic equipment
CN106502979A (en) * 2016-09-20 2017-03-15 海信集团有限公司 A kind of data processing method of natural language information and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106156056A (en) * 2015-03-27 2016-11-23 联想(北京)有限公司 A kind of Text Mode learning method and electronic equipment
CN105512285A (en) * 2015-12-07 2016-04-20 南京大学 Self-adaption web crawler method based on machine learning
CN105786782A (en) * 2016-03-25 2016-07-20 北京搜狗科技发展有限公司 Word vector training method and device
CN105894088A (en) * 2016-03-25 2016-08-24 苏州赫博特医疗信息科技有限公司 Medical information extraction system and method based on depth learning and distributed semantic features
CN106502979A (en) * 2016-09-20 2017-03-15 海信集团有限公司 A kind of data processing method of natural language information and device

Also Published As

Publication number Publication date
CN107451295A (en) 2017-12-08

Similar Documents

Publication Publication Date Title
Xu et al. Commit message generation for source code changes
CN107797991B (en) Dependency syntax tree-based knowledge graph expansion method and system
Salloum et al. A survey of lexical functional grammar in the Arabic context
CN107818085B (en) Answer selection method and system for reading understanding of reading robot
CN109213995A (en) A kind of across language text similarity assessment technology based on the insertion of bilingual word
CN104615589A (en) Named-entity recognition model training method and named-entity recognition method and device
Curto et al. Question generation based on lexico-syntactic patterns learned from the web
CN103246732B (en) A kind of abstracting method of online Web news content and system
JP6526470B2 (en) Pre-construction method of vocabulary semantic patterns for text analysis and response system
KR101348282B1 (en) Method for generating animation from text, Apparatus thereof
CN110427478A (en) A kind of the question and answer searching method and system of knowledge based map
CN107656921A (en) A kind of short text dependency analysis method based on deep learning
Zhang et al. Rule-based extraction of spatial relations in natural language text
Jakupović et al. Formalisation method for the text expressed knowledge
CN106156143A (en) Page processor and web page processing method
Jain et al. Vishit: A visualizer for hindi text
KR101662433B1 (en) Method and apparatus for expanding knowledge base using open information extraction
Xie et al. Visual clues: Bridging vision and language foundations for image paragraph captioning
CN107451295B (en) Method for obtaining deep learning training data based on grammar network
Dannélls On generating coherent multilingual descriptions of museum objects from Semantic Web ontologies
Joshi et al. Micro-parsing of Hindi words
CN104331472B (en) Segment the building method and device of training data
Aroonmanakun et al. Thai monitor corpus: Challenges and contribution to thai nlp
CN112528680A (en) Corpus expansion method and system
Görz et al. Spatial cognition in historical geographical texts and maps: towards a cognitive-semantic analysis of Flavio Biondo's “Italia Illustrata.”

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant