CN115757727A

CN115757727A - Grammar point based retrieval method and device and text-based retrieval platform

Info

Publication number: CN115757727A
Application number: CN202211439319.5A
Authority: CN
Inventors: 杨麟儿; 朱君辉; 朱琳; 刘鑫; 杨尔弘
Original assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Current assignee: BEIJING LANGUAGE AND CULTURE UNIVERSITY
Priority date: 2022-11-17
Filing date: 2022-11-17
Publication date: 2023-03-07

Abstract

The invention relates to the technical field of language processing, in particular to a method and a device for searching based on grammar points and a text-centered searching platform, wherein the method comprises the following steps: acquiring an original corpus file, preprocessing the original corpus file, and marking difficulty levels of the preprocessed original corpus file to obtain marked corpuses; uploading the labeled corpus to a Chinese retrieval platform, and creating a corresponding index; determining rules and an initial search formula of a search language; establishing grammar points, and determining different types of retrieval indexes corresponding to the grammar points according to the rules; and sending a retrieval request to the center of literary retrieval platform according to the retrieval formula corresponding to the grammar point, and determining a result corresponding to the grammar point. By adopting the invention, the retrieval accuracy can be improved.

Description

Grammar point based retrieval method and device and text-based retrieval platform

Technical Field

The invention relates to the technical field of language processing, in particular to a method and a device for searching based on grammar points and a text-core searching platform.

Background

The corpus is a comprehensive language resource for collecting various types of language data, and plays an important role in language ontology research and language application fields (such as language teaching, textbook compiling, lexicography and the like). With the increasing scale of language data accumulation and the increasing innovation of corpus technical development, corpora of various types and scales have been built at home and abroad for different research purposes, and various corpus retrieval platforms and tools are provided, so that the possibility of larger-scale retrieval and language systematic analysis is provided for linguistic related research.

The corpus construction is the core foundation. The corpus system provides detailed linguistic evidence for linguistic research, and the processing mode of the corpus and the functionality of the system search tool limit the specific application of the corpus in research. "work should first benefit its device when trying to improve its performance". The corpus construction work is well done, and the corpus retrieval mode is designed to be the premise for developing the related research based on the corpus.

Compared with the prior domestic Chinese corpus resource construction, the method has the following defects: the retrieval mode generally stays on a surface form of a sentence, retrieval constraint is carried out by matching keywords, words and parts of speech, the deep syntactic structure of the sentence is less concerned, and the retrieval mode is slightly laboursome for relatively complex retrieval requirements related to syntactic components, dependence collocation and the like; the retrieval mode is single, and the comprehensiveness and the user friendliness of the retrieval function are difficult to be considered at the same time. Generally speaking, the construction condition of the existing Chinese language database is not matched with the search requirement of becoming more refined, intelligent and simplified, which is not favorable for the development of language research based on the language database and the deepening of related research work.

Disclosure of Invention

The embodiment of the invention provides a method and a device for searching based on grammar points and a text-core searching platform. The technical scheme is as follows:

in one aspect, a method for performing retrieval based on grammar points is provided, where the method is implemented by an electronic terminal, and the method includes:

s1, obtaining an original corpus file, preprocessing the original corpus file, and labeling difficulty levels of the preprocessed original corpus file to obtain a labeled corpus;

s2, uploading the labeled corpus to a center-of-text retrieval platform, and creating a corresponding index;

s3, acquiring a retrieval language, and determining an initial retrieval formula corresponding to the retrieval language according to rules of the retrieval language;

s4, acquiring a pre-established grammar point, and determining specific retrieval formulas of different types of retrieval corresponding to the grammar point according to the rule;

and S5, sending a retrieval request to the text center retrieval platform according to the specific retrieval formula corresponding to the grammar point, and determining a result corresponding to the grammar point.

Optionally, the preprocessing the original corpus file includes:

and performing word segmentation, part of speech tagging, named entity recognition and dependency syntactic analysis on the original corpus file.

Optionally, the initial indexing configuration module comprises: character item, part of speech tag item, named entity item, dependency item, word difficulty item and complex item.

Optionally, the different types of search include a general type search and a pattern search.

Optionally, the common retrieval includes base retrieval, dependency retrieval and capture.

On the other hand, the text center retrieval platform is characterized by comprising a VUE front-end module, a Tornado rear-end module, a corpus labeling module and an Odinson rear-end module; wherein:

the VUE front-end module is used for user interaction;

the Tornado back-end module is used for receiving a front-end user request, processing the request, and sending a retrieval request to the Odinson back-end module to obtain a retrieval result;

the corpus labeling module is used for labeling corpora;

the Odinson back-end module is used for providing retrieval service,

the Odinson rear end module comprises a construction index submodule, a retrieval field setting submodule, a parent query submodule and a retrieval service submodule, wherein:

the construction index submodule is used for running retrieval backend services;

the search field setting submodule is used for setting fields including raw, word, tag, lemma, entity and dependencies;

the parent query submodule is used for retrieving the linguistic data of the formulated category;

and the retrieval service sub-module is used for providing retrieval service for the Tornado back-end module.

In another aspect, an apparatus for performing a search based on a syntax point is provided, and the apparatus is applied to a method for performing a search based on a syntax point, and the apparatus includes:

the system comprises a labeling module, a data processing module and a data processing module, wherein the labeling module is used for acquiring an original corpus file, preprocessing the original corpus file, and labeling difficulty levels of the preprocessed original corpus file to obtain a labeled corpus;

the creating module is used for uploading the labeled corpus to a center of context retrieval platform and creating a corresponding index;

the determining module is used for acquiring a retrieval language and determining an initial retrieval formula corresponding to the retrieval language according to rules of the retrieval language;

the building module is used for obtaining the pre-built grammar points and determining the specific retrieval formulas of different types of retrieval corresponding to the grammar points according to the rules;

and the retrieval module is used for sending a retrieval request to the text center retrieval platform according to the specific retrieval formula corresponding to the grammar point and determining a result corresponding to the grammar point.

Optionally, the preprocessing the original corpus file includes:

and performing word segmentation, part-of-speech tagging, named entity recognition and dependency syntactic analysis on the original corpus file.

Optionally, the initial search-type construction module comprises: character item, part of speech tag item, named entity item, dependency item, word difficulty item and complex item.

Optionally, the different types of retrieval include normal type retrieval and pattern retrieval, and the normal retrieval includes base retrieval, dependency retrieval and capture.

In another aspect, an electronic device is provided, which includes a processor and a memory, where the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the above method for retrieving based on syntax points.

In another aspect, a computer-readable storage medium is provided, in which at least one instruction is stored, and the at least one instruction is loaded and executed by a processor to implement the above method for retrieving based on syntax points.

The technical scheme provided by the embodiment of the invention has the beneficial effects that at least:

by searching according to the dependency syntax, even if the grammar points are relatively complex grammar points or grammar points containing syntax components with long distance, relatively accurate results can be obtained; the vocabulary difficulty level can be limited during retrieval, so that teachers can fully consider the Chinese level of students during retrieval, example sentences suitable for different students can be retrieved, and the retrieval pertinence is improved; the capturing function facilitates the teacher to check the indefinite components in the sentence, and can help the teacher to check the collocation and clustering among the vocabularies more easily. The functions play a great role in the search of example sentences by teachers, improve the quality and efficiency of lesson preparation by teachers, and also help teaching materials or test paper to compile and provide rich examples.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic structural diagram of a context-based search platform according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for performing a search based on grammar points according to an embodiment of the present invention;

FIG. 3 is a block diagram of an apparatus for performing a search based on grammar points according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic terminal according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments.

An embodiment of the present invention provides a text-center search platform, and as shown in fig. 1, the text-center search platform may include a VUE front-end module, a Tornado back-end module, a corpus tagging module, and an Odinson back-end module. Each module is described below:

1. VUE front end module

In a feasible implementation mode, the VUE is a set of progressive framework for constructing a user interface and is a JS framework, and an interactive interface can be provided for a user by using the VUE front-end module in the implementation of the invention, so that the user can conveniently input a retrieval language and check a retrieval result.

2. Tornado back end module

In a feasible implementation mode, tornado is an open-source Web server and Web application framework written by using Python, can provide WebSocket service, long connection service, HTTP short link service, UDP service and the like, and is very suitable for developing long polling, webSocket and applications which need to establish persistent connection with each user. In the embodiment of the invention, a Tornado back-end module is used for receiving a front-end user request, processing the request, and sending a retrieval request to an Odinson back-end module to obtain a retrieval result.

3. Corpus labeling module

In a feasible implementation manner, the corpus tagging module tags the corpus by using a Stanford CoreNLP tagging tool, and the Stanford CoreNLP tagging tool can perform word segmentation, sentence segmentation, part of speech tagging and syntax dependency tagging on the corpus. The labeled corpus is organized in a JSON data format, and fields such as raw, word, tag, lemma, entity, dependences and the like in the corpus are labeled.

4. An Odinson back-end module for providing a search service,

in a possible implementation, the Odinson backend module is a search engine built by Scala language, which has the advantage of providing graph search for dependency graphs in addition to basic full-text search functions.

the construction index submodule is used for constructing an inverted index which can be retrieved by a retrieval engine by using the labeled corpus;

a retrieval field setting submodule, wherein the set fields comprise raw, word, tag, lemma, entity and dependencies;

the parent query submodule is used for retrieving the corpora of the formulated category;

When the text-centered retrieval platform is used, a corpus labeling module is needed to label the corpus to be retrieved in the early stage, and an Odinson rear-end module constructs an inverted index by using the labeled corpus to be retrieved; when a user uses the text center retrieval platform at a terminal, the user inputs a retrieval expression composed of retrieval languages on the terminal, the VUE front-end module receives the input of the user, sends a user retrieval statement to the Tornado rear end, directly sends the retrieval statement to the Odinson rear end when the user retrieval request is a common retrieval, when the user retrieval request is a high-level retrieval, the Tornado rear end performs dependency analysis on the retrieval statement, extracts a dependency path which the user wants to retrieve, constructs a retrieval expression for Odinson retrieval, then sends the retrieval expression to the Odinson rear end for requesting retrieval, the Odinson rear end performs retrieval on the retrieval expression, returns the retrieval result to the Tornado rear end, further processes the retrieval result to a data format which is conveniently displayed by the VUE, returns the result to the VUE front end, and displays the result to the user.

The embodiment of the invention provides a method for searching based on grammar points, which can be realized by an electronic terminal, wherein the electronic terminal can be a terminal or a server. As shown in fig. 2, the processing flow of the method for retrieving based on grammar points may include the following steps:

s1, obtaining an original corpus file, preprocessing the original corpus file, and labeling difficulty levels of the preprocessed original corpus file to obtain a labeled corpus.

The original corpus file may be a file in a plain text format, and the preprocessing operation may include, but is not limited to, performing operations such as word segmentation, part of speech tagging, named entity recognition, dependency parsing, and the like on the original corpus file.

In one possible embodiment, the types of the original corpus file may include two major categories: the specific corpus scale of news press corpus and bilingual textbook corpus is shown in table 1 below.

TABLE 1

Corpus type	Byte number/word	Number of sentences/sentence
			Newspaper and magazine corpus	1,600,000,000	39,500,000
Two-language teaching material	5,376,330	243,071

After the sentence break of the original corpus is completed, the sentence can be preprocessed by using a natural language processing tool. The segmentation and part-of-speech tagging may employ the part-of-speech tagging system of the Bingzhou Chinese treelike CTB (Xue et al, 2005), named entity recognition may employ Stanford NER (Finkel et al, 2005), and syntactic tagging may employ the Stanford university's dependency syntactic tagging specification (De Marnee et al, 2010).

After the preprocessing is completed, difficulty grade marking is carried out on the preprocessed original corpus files, and the difficulty grade marking can be carried out manually or by adopting the existing marking model, which is not limited in the embodiment of the invention.

And S2, uploading the labeled corpus to a retrieval platform, and creating a corresponding index.

In a possible implementation manner, the retrieval platform in the embodiment of the present invention is a pre-constructed literary retrieval platform, and the construction concept of the platform may be as follows:

(1) The system is used for language teaching and research. On one hand, the method can provide corresponding example sentence reference for teachers, and can solve the problem that the conventional example sentence selection is difficult; on the other hand, the method can provide extended learning example sentences for Chinese learners, and the learners can learn the most frequently-appearing knowledge of contexts, pragmatics, collocations and the like of Chinese words through the capture function.

(2) The dependency syntax information is fully utilized. The user can not only search words and words, but also restrict the parts of speech, the type of named entity and the dependency relationship of the words and the words.

(3) Powerful search function and simple search language. The Chinese language material library retrieval system defines a user-friendly and powerful language material library retrieval statement, and aims to provide a highly accurate retrieval result and keep the simplicity of retrieval language.

The document center retrieval platform can specifically comprise the following retrieval modes:

(1) And ordinary retrieval is realized through a Wen Xinyu material library retrieval formula, and a regular expression is supported. Including base retrieval, dependency syntax retrieval, and capture.

(2) Schema retrieval, which does not require the user to know the details of the underlying grammatical representation, is queried by providing an example sentence with a simple tag.

It should be noted that the creating and retrieving manner may be a creating manner commonly used in the prior art, and this is not described in detail in the embodiment of the present invention.

And S3, acquiring the retrieval language, and determining rules and an initial retrieval formula of the retrieval language.

In a possible implementation, the search language may be a search formula input by a user, and during the input process, the input data may include units of words, operators, quantifiers, and the like to be searched, which together form the search language.

Wherein, the construction module of initial search formula includes: the method comprises six forming modes of character items, part-of-speech tag items, named entity items, dependency items, word difficulty items and complex items, wherein the six forming modes can be used as additional conditions for retrieval, a user selects one or more items according to requirements, and the initial retrieval formula is formed by combining characters needing to be retrieved.

Specifically, the initial search formula is divided into two types, namely a basic item and a complex item, and the basic constituent unit comprises a character string, a part-of-speech tag, a dependency tag, a word difficulty tag, a named entity name, an operator and a quantifier. The rules of a specific search language, i.e., the form of construction and examples of the search formula, are listed in table 2.

TABLE 2

To aid in understanding the rules, or meaning and usage of the symbols in the above-described search equations, table 3 illustrates the specific meaning of all symbols in connection with the search examples.

TABLE 3

It should be noted that, in the process of constructing the initial search formula according to the above steps, the rule may be input into the electronic device, the user inputs the text to be searched and the search condition into the electronic device, and the electronic device automatically generates the initial search formula; the user may grasp the rule, construct an initial search formula according to the rule, the character to be searched, and the search condition, and input the constructed initial search formula into the electronic device.

And S4, acquiring the pre-established grammar points, and determining specific retrieval formulas of different types of retrieval corresponding to the grammar points according to rules.

The different types of retrieval comprise common type retrieval and mode retrieval, and the common retrieval comprises basic retrieval, dependent retrieval and capture. Ordinary retrieval, wherein the retrieval formula is formed by a regular expression, a logic operational character, a dependency relationship operational character and the like; and mode retrieval, wherein the retrieval formula is an example sentence, an anchor point with a symbol mark and content needing to be captured.

In a feasible implementation mode, the grammar points are Chinese teaching grammar points in the Chinese international education field written in summary Chinese level standard of international Chinese education (2021) grammar level outline. The grammar level schema is reconstructed to obtain grammar points which can reflect a grammar schema system and are convenient for automatic extraction, and generally more than 500 general grammar points can be constructed. And writing specific retrieval formulas of the Chinese grammar points one by one according to rules. Specific retriever types and specific examples are listed in table 4:

TABLE 4

And S5, sending a retrieval request to a retrieval platform according to the specific retrieval formula corresponding to the grammar point, and determining a result corresponding to the grammar point.

In a feasible implementation mode, a pre-constructed grammar point search formula is read in, the search formula is called in a cloud service mode, search requests corresponding to the grammar point search formula are sent to a search platform one by one, and the search platform searches and returns all matching results containing corresponding grammar points.

Preferably, the matching result is output and displayed in a unit of a natural sentence in the original corpus file (plain text format). In the query result, the searched item is displayed in bold. In the mode search, the anchor word is shown in bold. In addition, a result downloading button is arranged at the right upper corner of the retrieval result page, a user can specify the number of the downloaded retrieval results (default to 50) and the file name, and click the result downloading button, so that the query results can be stored to a local computer in a text file (txt) format. After each sentence, the information of chapter name, date, etc. of the sentence is noted.

It should be noted that, when a corpus retrieval platform is used for retrieval, a user can select a corpus type to be retrieved, and then automatically extract a retrieval result corresponding to a grammar point from the corpus according to the grammar point which is constructed in advance; if the content that the user wants to retrieve is not in the range of the pre-constructed grammar points, a retrieval formula can be constructed by the user, and then the retrieval is carried out on the text-centered retrieval platform, which is not limited by the invention.

In the embodiment of the invention, the retrieval is carried out by depending on the syntax, and even if the retrieval is carried out on relatively complex syntax points or syntax points containing syntax components with relatively long distance, relatively accurate results can be obtained; the vocabulary difficulty level can be limited during retrieval, so that teachers can be helped to fully consider the Chinese level of students during retrieval, example sentences suitable for different students are retrieved, and the retrieval pertinence is improved; the capturing function facilitates the teacher to check the indefinite components in the sentence, and can help the teacher to check the collocation and clustering among the vocabularies more easily. The functions play a great role in the search of example sentences by teachers, improve the quality and efficiency of lesson preparation by teachers, and also help teaching materials or test paper to compile and provide rich examples.

Fig. 3 is a block diagram illustrating an apparatus for retrieving based on syntax points in accordance with an exemplary embodiment. Referring to fig. 3, the apparatus includes:

a labeling module 310, configured to obtain an original corpus file, pre-process the original corpus file, and label difficulty levels of the pre-processed original corpus file to obtain a labeled corpus;

a creating module 320, configured to upload the labeled corpus to a corpus retrieval platform, and create a corresponding index;

the determining module 330 is configured to obtain a search language, and determine an initial search formula corresponding to the search language according to a rule of the search language;

the establishing module 340 is configured to obtain a pre-established syntax point, and determine a specific search formula of different types of searches corresponding to the syntax point according to the rule;

and the retrieval module 350 is configured to send a retrieval request to the core retrieval platform according to the specific retrieval formula corresponding to the grammar point, and determine a result corresponding to the grammar point.

Optionally, the labeling module 310 is configured to:

In the embodiment of the invention, the retrieval is carried out by depending on the syntax, and even if the retrieval is carried out on relatively complex syntax points or syntax points containing syntax components with relatively long distance, relatively accurate results can be obtained; the vocabulary difficulty level can be limited during retrieval, so that teachers can fully consider the Chinese level of students during retrieval, example sentences suitable for different students can be retrieved, and the retrieval pertinence is improved; the capturing function facilitates the teacher to check the indefinite components in the sentence, and can help the teacher to check the collocation and clustering among the vocabularies more easily. The functions play a great role in the search of example sentences by teachers, improve the quality and efficiency of lesson preparation by teachers, and also help teaching materials or test paper to compile and provide rich examples.

Fig. 4 is a schematic structural diagram of an electronic terminal 400 according to an embodiment of the present invention, where the electronic terminal 400 may generate relatively large differences due to different configurations or performances, and may include one or more processors (CPUs) 401 and one or more memories 402, where at least one instruction is stored in the memory 402, and the at least one instruction is loaded and executed by the processor 401 to implement the steps of the above method for performing retrieval based on syntax points.

In an exemplary embodiment, a computer-readable storage medium, such as a memory, is also provided that includes instructions executable by a processor in a terminal to perform the above-described method for syntax point based retrieval. For example, the computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A method for performing a search based on grammar points, the method comprising:

s4, obtaining a pre-established grammar point, and determining a specific retrieval formula of different types of retrieval corresponding to the grammar point according to the rule;

2. The method according to claim 1, wherein the preprocessing the original corpus file comprises:

3. The method of claim 1, wherein the initial searchable configuration module comprises: character item, part of speech tag item, named entity item, dependency item, word difficulty item and complex item.

4. The method of claim 1, wherein the different types of searches comprise normal type searches and pattern searches.

5. The method of claim 4, wherein the generic search includes a base search, a dependency search, and a capture.

6. A text-core retrieval platform is characterized by comprising a VUE front-end module, a Tornado rear-end module, a corpus tagging module and an Odinson rear-end module; wherein:

the VUE front-end module is used for user interaction;

the corpus labeling module is used for labeling corpora;

the Odinson back end module is used for providing retrieval service,

7. An apparatus for performing a search based on grammar points, the apparatus comprising:

8. The apparatus according to claim 7, wherein the preprocessing the raw corpus file comprises:

9. The apparatus of claim 7, wherein the initial searchable configuration module comprises: character item, part of speech tag item, named entity item, dependency item, word difficulty item and complex item.

10. The apparatus of claim 7, wherein the different types of searches comprise normal type searches and pattern searches, and wherein normal searches comprise base searches, dependent searches, and traps.