CN111325001B - Thesis identification method, thesis identification model training method, thesis identification device, thesis identification model training device, equipment and storage medium - Google Patents

Thesis identification method, thesis identification model training method, thesis identification device, thesis identification model training device, equipment and storage medium Download PDF

Info

Publication number
CN111325001B
CN111325001B CN201811528227.8A CN201811528227A CN111325001B CN 111325001 B CN111325001 B CN 111325001B CN 201811528227 A CN201811528227 A CN 201811528227A CN 111325001 B CN111325001 B CN 111325001B
Authority
CN
China
Prior art keywords
training
thesis
preset
paragraph
determining
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811528227.8A
Other languages
Chinese (zh)
Other versions
CN111325001A (en
Inventor
王怡然
陈巍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New Founder Holdings Development Co ltd
Beijing Founder Electronics Co Ltd
Original Assignee
Pku Founder Information Industry Group Co ltd
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Pku Founder Information Industry Group Co ltd, Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Pku Founder Information Industry Group Co ltd
Priority to CN201811528227.8A priority Critical patent/CN111325001B/en
Publication of CN111325001A publication Critical patent/CN111325001A/en
Application granted granted Critical
Publication of CN111325001B publication Critical patent/CN111325001B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Character Discrimination (AREA)

Abstract

The thesis identification and identification model training method, device, equipment and storage medium provided by the disclosure comprise the steps of obtaining a thesis to be identified; determining paragraph marks corresponding to the thesis to be recognized according to a preset recognition model; wherein, the preset recognition model is obtained by training according to a thesis training set in advance. According to the scheme provided by the disclosure, the model for representing the thesis can be obtained by using the thesis training set training model provided with the preset identification of the paragraphs, so that the identification corresponding to each paragraph in the thesis can be identified based on the preset identification model, and the problem that the format of the thesis needs to be edited manually in the prior art is solved.

Description

Thesis identification method, thesis identification model training method, thesis identification device, thesis identification model training device, equipment and storage medium
Technical Field
The present disclosure relates to text processing technologies, and in particular, to a method, an apparatus, a device, and a storage medium for thesis identification and identification model training.
Background
Currently, many papers are published in an online or offline format, thereby enabling more users to view the contents of the papers. Before a paper is published, the paper needs to be compiled, formatted and published.
In the prior art, in order to unify the format of a paper and facilitate the identification of the content corresponding to each paragraph, an author writing the paper is required to write the paper in a specified format. However, this approach does not guarantee that all authors write a paper in this format, which requires editing the format of the paper before it is published.
Therefore, the publication method in the prior art cannot be completely automated, and the format of the paper needs to be edited by depending on human.
Disclosure of Invention
The disclosure provides a thesis identification method, a thesis identification model training method, a thesis identification device, an identification model training device, equipment and a storage medium, and aims to solve the problem that the format of a thesis needs to be edited manually before the thesis is published in the prior art.
A first aspect of the present disclosure is to provide a paper identification method, including:
acquiring a paper to be identified;
determining paragraph marks corresponding to the thesis to be recognized according to a preset recognition model;
wherein, the preset recognition model is obtained by training according to a thesis training set in advance.
A second aspect of the present disclosure provides a training method for a preset recognition model, including:
acquiring a thesis training set;
training a model according to the thesis training set to obtain a preset recognition model;
the paper training set comprises a plurality of training papers, and the paragraphs of the training papers are preset with marks.
A third aspect of the present disclosure is to provide a paper identification apparatus, including:
the acquisition module is used for acquiring the thesis to be identified;
the determining module is used for determining paragraph identifiers corresponding to the papers to be recognized according to a preset recognition model;
wherein, the preset recognition model is obtained by training according to a thesis training set in advance.
A fourth aspect of the present disclosure is to provide a training apparatus for presetting a recognition model, including:
the acquisition module is used for acquiring a thesis training set;
the training module is used for training a model according to the thesis training set to obtain a preset recognition model;
the paper training set comprises a plurality of training papers, and the paragraphs of the training papers are preset with marks.
A fifth aspect of the present disclosure is to provide a paper identification apparatus, including:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the paper identification method as described in the above first aspect.
A sixth aspect of the present disclosure is to provide a training apparatus for presetting a recognition model, including:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of training a pre-set recognition model as described in the second aspect above.
A seventh aspect of the present disclosure is to provide a computer-readable storage medium having stored thereon a computer program for execution by a processor to implement the paper identification method as described in the first aspect above.
An eighth aspect of the present disclosure is to provide a computer-readable storage medium, on which a computer program is stored, the computer program being executed by a processor to implement the training method of the preset recognition model as described in the above first aspect.
The thesis identification and identification model training method, device, equipment and storage medium provided by the disclosure have the technical effects that:
the thesis identification and identification model training method, device, equipment and storage medium provided by the disclosure comprise the steps of obtaining a thesis to be identified; determining paragraph marks corresponding to the thesis to be recognized according to a preset recognition model; wherein, the preset recognition model is obtained by training according to a thesis training set in advance. According to the scheme provided by the disclosure, the model for representing the thesis can be obtained by using the thesis training set training model provided with the preset identification of the paragraphs, so that the identification corresponding to each paragraph in the thesis can be identified based on the preset identification model, and the problem that the format of the thesis needs to be edited manually in the prior art is solved.
Drawings
FIG. 1 is a flow diagram illustrating a paper identification method in accordance with an exemplary embodiment of the present invention;
FIG. 2 is a flowchart illustrating a paper identification method according to another exemplary embodiment of the present invention;
FIG. 3 is a flowchart illustrating a method for training a predetermined recognition model according to an exemplary embodiment of the present invention;
FIG. 4 is a flowchart illustrating a method for training a predetermined recognition model according to another exemplary embodiment of the present invention;
FIG. 5 is a block diagram illustrating a paper identification apparatus in accordance with an exemplary embodiment of the present invention;
fig. 6 is a block diagram illustrating a paper identification apparatus according to another exemplary embodiment of the present invention;
FIG. 7 is a block diagram illustrating a training apparatus for presetting a recognition model according to an exemplary embodiment of the present invention;
FIG. 8 is a block diagram of a training apparatus for pre-setting a recognition model according to another exemplary embodiment of the present invention;
FIG. 9 is a block diagram illustrating a paper identification device in accordance with an exemplary embodiment of the present invention;
fig. 10 is a block diagram illustrating a training apparatus for presetting a recognition model according to an exemplary embodiment of the present invention.
Detailed Description
The author may write the paper in a predetermined format, for example, first writing "title: XXX ", re-writing" author: XXX ", and" author unit: XXX ", if the author writes" title: "," author: "and" author unit: the identifications can directly identify the identifications in the papers, thereby determining the corresponding contents of the paragraphs and providing a basis for automatic editing and publishing of the papers. However, some authors may not write exactly in a preset format when writing a paper, e.g., do not write "title: "but rather directly write the content of the title, in which case the content of the title cannot be accurately identified in the paper.
Based on this, embodiments of the present invention provide a thesis identification method and a training method for a preset identification model, where the preset identification model is obtained based on a thesis training set, and the model can identify an identifier corresponding to each paragraph in the thesis according to the content of each paragraph in the thesis, and even if an author does not write an identifier in the thesis, the identifier corresponding to each paragraph can be identified based on the model, so as to solve the problem in the prior art that the format of the thesis needs to be edited manually.
Fig. 1 is a flowchart illustrating a paper identification method according to an exemplary embodiment of the present invention.
As shown in fig. 1, the paper identification method provided in this embodiment includes:
step 101, obtaining a paper to be identified.
The method provided by the embodiment can be executed by an electronic device with a computing function, such as a computer. The electronic device is used for identifying the identification of each paragraph in the paper.
Specifically, the preset recognition model is stored in the electronic device, and the recognition model may be obtained by training other devices, or may be obtained by performing the method provided in this embodiment, which is not limited in this embodiment.
Further, a database for storing the papers may be provided, and the database may be provided in the electronic device executing the method provided by the embodiment, or may be provided in other devices. For example, the database may be provided in a server connected to an electronic device that performs the method provided by the present embodiment. The user can upload the thesis through the user side, and the server can receive the thesis uploaded by the user and store the thesis in the database. The electronic device can access the database, and the database can also actively push the paper to the electronic device, so that the electronic device can acquire the paper to be identified. When the electronic device obtains the papers to be identified, the first-in first-out principle can be adopted, and the earliest paper uploaded by the user can be processed first.
In practical application, when a user uploads a paper through a user terminal, the paper can be directly received by the electronic device executing the method provided by the embodiment, so that the electronic device can acquire the paper to be identified and directly identify the uploaded paper.
And 102, determining paragraph identifiers corresponding to the papers to be recognized according to a preset recognition model.
The preset recognition model is obtained by training according to a thesis training set in advance.
Specifically, the electronic device executing the method provided by the embodiment stores a preset recognition model. The paper to be recognized can be input into a preset recognition model, and then the preset recognition model outputs a recognition result.
Furthermore, before training the preset recognition model, a large number of papers may be collected as a paper training set, and these papers may or may not have an identifier labeled by the author during the writing process. The labels corresponding to the paragraphs in these papers can be predetermined, and the contents of the papers and the labels corresponding to the paragraphs in the papers are input into the model, so as to train the weight values inside the model.
In practical application, the trained recognition model can be directly applied to output the identification of the paper to be recognized. Specifically, the identifier corresponding to each paragraph in the thesis to be recognized may be output, for example, the first paragraph is a title, the second paragraph is an author, and the third paragraph is an author unit.
If the author writes specific identifiers in the process of writing, the model can determine identifiers corresponding to the paragraphs according to the existing identifiers, and if the paragraphs have no identifiers, the identifiers corresponding to the paragraphs can be calculated according to weights in the model.
The method provided by the embodiment is used for identifying a paper to be identified, and the method is executed by a device provided with the method provided by the embodiment, and the device is generally implemented in a hardware and/or software manner.
The paper identification method provided by the embodiment comprises the steps of obtaining a paper to be identified; determining paragraph marks corresponding to the thesis to be recognized according to a preset recognition model; the preset recognition model is obtained by training according to a thesis training set in advance. The method provided by the embodiment can identify the identifier corresponding to each paragraph in the thesis based on the preset identification model, so that the problem that the format of the thesis needs to be edited manually in the prior art is solved.
Fig. 2 is a flowchart illustrating a paper identification method according to another exemplary embodiment of the present invention.
As shown in fig. 2, the method for identifying a thesis provided in this embodiment includes:
step 201, a paper to be identified is obtained.
The specific principle and implementation of step 201 are similar to those of step 101, and are not described herein again.
Step 202, performing word segmentation processing on the paragraphs included in the paper to be identified to obtain the word segmentation included in each paragraph.
In the method provided by this embodiment, instead of directly inputting the paper to be recognized into the preset recognition model, word segmentation processing may be performed on the paper to be recognized, and then the identifier corresponding to each paragraph may be determined based on the obtained word segmentation by the model to be recognized.
The word segmentation algorithm can be preset, so that each paragraph included in the paper to be recognized is subjected to word segmentation processing according to the word segmentation algorithm. Chinese Word Segmentation refers to the Segmentation of a Chinese character sequence into individual words. The word segmentation algorithm is a process capable of recombining continuous word sequences into word sequences according to a certain specification. A segmentation dictionary can be set, so that the paragraphs in the paper are segmented according to the segmentation dictionary.
Specifically, the thesis may include a plurality of paragraphs, and each paragraph may be subjected to word segmentation processing, so as to obtain a word segmentation corresponding to each paragraph. In the word segmentation process, the semantic words in the paragraphs and words without actual meanings, such as "o", can be removed, so that the paragraphs are identified only according to the words with actual meanings, and the calculation amount in the paragraph recognition process can be reduced.
And step 203, determining the corresponding mark of the paragraph according to the preset recognition model and the word segmentation.
Further, in the method provided by this embodiment, the preset recognition model may determine the identifier corresponding to the paragraph based on the word segmentation in the paragraph. For example, the participles in each paragraph may be used as a participle combination, and the preset recognition model is input, so that the recognition model determines the paragraph identifier corresponding to the participle combination, that is, the paragraph identifier corresponding to the paragraph to which the participle combination belongs.
In another embodiment, a preset word bank may be further provided, wherein a plurality of words are provided in the preset word bank, and the words in the preset secondary bank may be considered as belonging to the same class of words. For example, synonyms or similar words can be combined to form a preset word stock.
The preset word bank to which the participle belongs can be determined, and the frequency of the participle included in each paragraph belonging to each preset sub-bank is determined. For a paragraph, a plurality of segmented words can be determined, and each segmented word has a corresponding preset word bank. Therefore, the frequency that the participles included in the paragraph belong to each preset lexicon can be counted, for example, if the paragraph includes 5 participles and respectively belongs to the lexicon A, A, B, C, C, the preset secondary lexicon frequencies corresponding to the paragraph can be obtained as A-0.4, B-0.2 and C-0.4.
Specifically, the feature vector corresponding to each paragraph is determined according to the frequency. The frequency can be directly taken as a feature vector, such as (0.4, 0.2, 0.4). In general, each preset lexicon includes words, which may not all appear in the same paragraph, and therefore, in order to express the meaning of the paragraph more accurately through the feature vector, the probability of the non-appearing lexicon may also be set to 0. For example, if the predetermined sub-base also includes the lexicon D, but the participles in the paragraph do not fall into the lexicon D, the feature vector may be (0.4, 0.2, 0.4, 0).
Further, the feature vectors also replace the paragraphs of the whole paper, so that the data processing amount of the recognition model can be reduced.
In practical application, the identifier corresponding to the paragraph can be determined according to the feature vector based on the preset identification model. The feature vector can be input into a preset recognition model, the feature vector is calculated according to a weight value in the preset recognition model, and paragraph identification corresponding to the feature vector, namely the identification of the paragraph corresponding to the feature vector, can be output. For example, the output paragraph identification may be "paper content".
The method provided by the embodiment can perform word segmentation on the paper to be recognized, determine the feature vector corresponding to each paragraph based on the obtained word segmentation, and input the feature vector into the preset recognition model, so that the identifier corresponding to the paragraph is determined according to the preset recognition model, and the calculation amount of the recognition model can be reduced.
Fig. 3 is a flowchart illustrating a training method of a preset recognition model according to an exemplary embodiment of the present invention.
As shown in fig. 3, the training method of the preset recognition model provided in this embodiment includes:
step 301, obtaining a thesis training set.
The method provided by the embodiment can be executed by an electronic device with a computing function, such as a computer. The electronic equipment is used for training a preset recognition model.
Specifically, the electronic device may be only used for training a preset recognition model, and may also be used for recognizing a paper to be recognized, that is, the electronic device used for training may be the same as or different from the electronic device used for identifying the paper. The present embodiment does not limit this.
Further, a database for storing the training set of papers may be provided, and the database may be provided in the electronic device executing the method provided by the embodiment, or may be provided in other devices. For example, the database may be provided in a server connected to an electronic device that executes the method provided in the present embodiment. These papers can be obtained in advance, and paragraph identifiers corresponding to respective paragraphs in the papers can be determined.
Step 302, training the model according to the thesis training set to obtain a preset recognition model.
The paper training set comprises a plurality of training papers, and the paragraphs of the training papers are preset with marks.
Specifically, the model may be trained based on a paper set prepared in advance to obtain a preset recognition model.
Further, weight values inside the model may be initialized. And inputting the thesis included in the thesis set into a model, wherein the model can determine the corresponding identification of each paragraph of the thesis according to the current weight, then the determined result is compared with the preset paragraph identification, and the weight value in the model is adjusted according to the comparison result.
In practical application, the papers can be input into the model one by one, the weight in the model can be corrected once when one paper is input, and a set of accurate models for identifying the papers can be obtained through multiple times of adjustment.
Wherein, a part of the papers in the training set of papers can be used as the test set. After the model is trained, the thesis in the test set can be input into the model, paragraph identifiers corresponding to all paragraphs in the thesis are determined, the paragraph identifiers are compared with the predetermined identifiers, if the accuracy is higher than a preset threshold value, the model can be considered to be accurate, and the model can be used for identifying the thesis provided by a user.
Specifically, the papers in the test set may be used for training the model, or may be used for testing only. The preset threshold value can be set according to requirements.
The method provided by the present embodiment is used for training the preset recognition model, and is performed by a device provided with the method provided by the present embodiment, and the device is generally implemented in a hardware and/or software manner.
The training method of the preset recognition model provided by the embodiment comprises the steps of obtaining a thesis training set; training the model according to a thesis training set to obtain a preset recognition model; the paper training set comprises a plurality of training papers, and the paragraphs of the training papers are preset with marks. According to the method provided by the embodiment, the model for representing the thesis can be obtained by using the thesis training set training model provided with the preset identification of the paragraphs, so that the identification corresponding to each paragraph in the thesis can be identified based on the preset identification model, and the problem that the format of the thesis needs to be edited manually in the prior art is solved.
Fig. 4 is a flowchart illustrating a training method of a predetermined recognition model according to another exemplary embodiment of the present invention.
As shown in fig. 4, the training method of the preset recognition model provided in this embodiment includes:
step 401, obtaining a thesis training set.
The specific principle and implementation of step 401 are similar to those of step 301, and are not described herein again.
Step 402, performing word segmentation processing on paragraphs included in the training paper to obtain training words included in each paragraph.
In the method provided in this embodiment, in the process of training the model, the thesis for training may not be directly input to the model, but word segmentation is performed on the training thesis, and then the model is trained according to the obtained training word segmentation.
The word segmentation algorithm can be preset, so that each paragraph included in the training paper is subjected to word segmentation processing according to the word segmentation algorithm. Chinese Word Segmentation refers to segmenting a Chinese character sequence into individual words. The word segmentation algorithm is a process capable of recombining continuous word sequences into word sequences according to a certain specification. A segmentation dictionary can be set, so that the paragraphs in the paper are segmented according to the segmentation dictionary.
Specifically, the training paper may include a plurality of paragraphs, and each paragraph may be subjected to word segmentation processing, so as to obtain a training word segmentation corresponding to each paragraph. In the word segmentation process, the Chinese meaning words in the paragraphs and the words without actual meanings, such as "o" and the like, can be removed, so that the paragraphs are identified only according to the words with actual meanings, and the calculation amount in the paragraph identification process can be reduced.
And 403, training the model according to the training segmentation words to obtain a preset recognition model.
Further, in the method provided by this embodiment, the model may be trained based on the training segmented words in the paragraphs. For example, a weight value in the model may be initialized, then the training participles in each paragraph are used as a participle combination, the model is input into the model, the model may determine a paragraph identifier corresponding to the participle combination according to the current weight value, that is, an identifier of the paragraph corresponding to the participle combination, the determined identifier may be compared with a preset identifier of the paragraph, and the weight value in the model may be adjusted according to the comparison result. Through multi-round adjustment, the accurate preset recognition model can be obtained.
In another embodiment, a preset word bank may be further provided, where a plurality of words are provided in the preset word bank, and the words in the preset sub-bank may be considered to belong to the same class of words. For example, synonyms or similar words can be combined to form a preset word stock.
The preset word bank to which the training participle belongs can be determined, and the frequency of the training participle included in each paragraph belonging to each preset sub-bank is determined. For a paragraph, a plurality of training segmented words can be determined, and each segmented word has a corresponding preset word bank. Therefore, the frequency that the training participles included in the paragraph belong to each preset lexicon can be counted, for example, if the paragraph includes 5 training participles and the training respectively belongs to the lexicon A, A, B, C, C, the preset lexicon frequencies corresponding to the paragraph can be obtained as A-0.4, B-0.2 and C-0.4.
Specifically, the feature vector corresponding to each paragraph is determined according to the frequency. The frequency can be directly taken as a feature vector, such as (0.4, 0.2, 0.4). In general, it is unlikely that words included in each preset lexicon will all appear in the same paragraph, and therefore, in order to express the meaning of a paragraph more accurately by a feature vector, the probability of an absent lexicon may also be set to 0. For example, if the predetermined sub-base also includes the lexicon D, but the training participles in the paragraph do not fall into the lexicon D, the feature vector may be (0.4, 0.2, 0.4, 0).
Further, by replacing the paragraphs of the whole paper with feature vectors, the data throughput of the model can be reduced.
In practical application, the preset recognition model can be obtained according to the training feature vector corresponding to the paragraph and the preset identification training model. The feature vector may be input into the model, and the feature vector may be calculated according to the current weight value inside the model, so as to determine the paragraph identifier corresponding to the feature vector, that is, the paragraph identifier corresponding to the feature vector. For example, the determined paragraph identity may be "paper content". The determined identifier may be compared with a preset identifier, and a weight value in the model may be adjusted according to a comparison result.
The method provided by this embodiment can perform word segmentation on the training paper, determine the feature vector corresponding to each paragraph based on the obtained training word segmentation, and then input the feature vector into the model, thereby determining the identifier corresponding to the paragraph according to the model, and then adjust the weight value in the model based on the preset identifier, so as to reduce the calculation amount in the model training process.
Fig. 5 is a block diagram illustrating a paper identification apparatus according to an exemplary embodiment of the present invention.
As shown in fig. 5, the paper identification apparatus provided in this embodiment includes:
an obtaining module 51, configured to obtain a paper to be identified;
a determining module 52, configured to determine, according to a preset recognition model, a paragraph identifier corresponding to the thesis to be recognized;
wherein, the preset recognition model is obtained by training according to a thesis training set in advance.
The paper identification device provided by the embodiment comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring a paper to be recognized; the determining module is used for determining paragraph identifiers corresponding to the papers to be recognized according to a preset recognition model; the preset recognition model is obtained by training according to a thesis training set in advance. The device provided by the embodiment can identify the identifier corresponding to each paragraph in the thesis based on the preset identification model, so that the problem that the format of the thesis needs to be edited manually in the prior art is solved.
The specific principle and implementation of the paper identification apparatus provided in this embodiment are similar to those of the embodiment shown in fig. 1, and are not described herein again.
Fig. 6 is a block diagram illustrating a paper identification apparatus according to another exemplary embodiment of the present invention.
As shown in fig. 6, on the basis of the embodiment shown in fig. 5, in the paper identification apparatus provided in this embodiment, the determining module 52 includes:
a word segmentation unit 521, configured to perform word segmentation processing on paragraphs included in the thesis to be identified, so as to obtain words included in each paragraph;
a determining unit 522, configured to determine, according to the preset recognition model and the word segmentation, an identifier corresponding to the paragraph.
The determining unit 522 is specifically configured to:
determining a preset word bank to which the participle belongs, and determining the frequency of the participle included in each paragraph belonging to each preset sub-bank;
determining a feature vector corresponding to each paragraph according to the frequency;
and determining the identifier corresponding to the paragraph according to the feature vector based on the preset identification model.
The specific principle and implementation of the paper identification apparatus provided in this embodiment are similar to those of the embodiment shown in fig. 2, and are not described here again.
Fig. 7 is a block diagram illustrating a training apparatus for presetting a recognition model according to an exemplary embodiment of the present invention.
As shown in fig. 7, the training apparatus for presetting a recognition model according to this embodiment includes:
an obtaining module 71, configured to obtain a thesis training set;
a training module 72, configured to train a model according to the thesis training set to obtain a preset recognition model;
the paper training set comprises a plurality of training papers, and the paragraphs of the training papers are preset with marks.
The training device for the preset recognition model provided by the embodiment comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring a thesis training set; the training module is used for training the model according to the thesis training set to obtain a preset recognition model; the paper training set comprises a plurality of training papers, and the paragraphs of the training papers are preset with marks. The apparatus provided in this embodiment can obtain a model for representing a thesis by using a thesis training set training model provided with a preset identifier of a paragraph, so that an identifier corresponding to each paragraph in the thesis can be identified based on a preset identification model, and a problem that a format of the thesis needs to be edited manually in the prior art is solved.
The specific principle and implementation of the training apparatus for presetting a recognition model provided in this embodiment are similar to those of the embodiment shown in fig. 3, and are not described herein again.
Fig. 8 is a block diagram illustrating a training apparatus for presetting a recognition model according to another exemplary embodiment of the present invention.
As shown in fig. 8, on the basis of the embodiment shown in fig. 7, in the training apparatus for presetting a recognition model provided in this embodiment, the training module 72 includes:
a word segmentation unit 721, configured to perform word segmentation processing on paragraphs included in the training paper to obtain training words included in each paragraph;
and the training unit 722 is configured to train a model according to the training part words to obtain the preset recognition model.
The training unit 722 is specifically configured to:
determining a preset word bank to which the training participle belongs, and determining the frequency of the training participle included in each paragraph belonging to each preset sub-bank;
determining training feature vectors corresponding to the sections according to the frequency;
and obtaining the preset recognition model according to the training feature vector corresponding to the paragraph and a preset identification training model.
Fig. 9 is a block diagram illustrating a paper identification apparatus according to an exemplary embodiment of the present invention.
As shown in fig. 9, the paper identification device provided in this embodiment includes:
a memory 91;
a processor 92; and
a computer program;
wherein said computer program is stored in said memory 91 and configured to be executed by said processor 92 to implement any of the paper identification methods as described in fig. 1-2.
Fig. 10 is a block diagram illustrating a training apparatus for presetting a recognition model according to an exemplary embodiment of the present invention.
As shown in fig. 10, the training apparatus for presetting a recognition model provided in this embodiment includes:
a memory 1001;
a processor 1002; and
a computer program;
wherein the computer program is stored in the memory 1001 and configured to be executed by the processor 1002 to implement a training method of any one of the preset recognition models as described in fig. 3-4.
The present embodiments also provide a computer-readable storage medium, having stored thereon a computer program,
the computer program is executed by a processor to implement any of the paper identification methods described in fig. 1-2.
The present embodiments also provide a computer-readable storage medium, having stored thereon a computer program,
the computer program is executed by a processor to implement a training method of any one of the preset recognition models as described in fig. 3-4.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (8)

1. A method of paper identification, comprising:
acquiring a paper to be identified;
performing word segmentation processing on paragraphs included in the thesis to be identified to obtain words included in each paragraph;
determining a preset word bank to which the participle belongs, and determining the frequency of the participle included in each paragraph belonging to each preset word bank;
determining a feature vector corresponding to each paragraph according to the frequency;
determining the mark corresponding to the paragraph according to the feature vector based on a preset recognition model;
wherein, the preset recognition model is obtained by training according to a thesis training set in advance.
2. A training method for a preset recognition model is characterized by comprising the following steps:
acquiring a thesis training set;
the paper training set comprises a plurality of training papers, and the paragraphs of the training papers are preset with marks;
performing word segmentation processing on paragraphs included in the training paper to obtain training words included in each paragraph;
determining a preset word bank to which the training participle belongs, and determining the frequency of the training participle included in each paragraph belonging to each preset word bank;
determining training feature vectors corresponding to the sections according to the frequency;
and obtaining the preset recognition model according to the training feature vector corresponding to the paragraph and a preset identification training model.
3. An article identification apparatus, comprising:
the acquisition module is used for acquiring the thesis to be identified;
the determining module is used for determining paragraph identifiers corresponding to the papers to be recognized according to a preset recognition model;
wherein, the preset recognition model is obtained by training according to a thesis training set in advance;
the determining module includes:
the word segmentation unit is used for performing word segmentation processing on the paragraphs included in the paper to be identified to obtain the words included in each paragraph;
the determining unit is used for determining a preset word bank to which the participle belongs and determining the frequency of the participle included in each paragraph belonging to each preset word bank; determining a feature vector corresponding to each paragraph according to the frequency; and determining the identifier corresponding to the paragraph according to the feature vector based on the preset identification model.
4. A training device for presetting a recognition model is characterized by comprising:
the acquisition module is used for acquiring a thesis training set;
the training module is used for training a model according to the thesis training set to obtain a preset recognition model;
the paper training set comprises a plurality of training papers, and the paragraphs of the training papers are preset with marks;
the training module comprises:
the word segmentation unit is used for performing word segmentation processing on the paragraphs included in the training paper to obtain training words included in each paragraph;
the training unit is used for determining a preset word bank to which the training participle belongs and determining the frequency of the training participle included in each paragraph belonging to each preset word bank; determining training feature vectors corresponding to the sections according to the frequency; and obtaining the preset recognition model according to the training feature vector corresponding to the paragraph and a preset identification training model.
5. A paper identification device, comprising:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of claim 1.
6. A training device for presetting a recognition model is characterized by comprising:
a memory;
a processor; and
a computer program;
wherein the computer program is stored in the memory and configured to be executed by the processor to implement the method of claim 2.
7. A computer-readable storage medium, having stored thereon a computer program,
the computer program is executed by a processor to implement the method as claimed in claim 1.
8. A computer-readable storage medium, having stored thereon a computer program,
the computer program is executed by a processor to implement the method as claimed in claim 2.
CN201811528227.8A 2018-12-13 2018-12-13 Thesis identification method, thesis identification model training method, thesis identification device, thesis identification model training device, equipment and storage medium Active CN111325001B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811528227.8A CN111325001B (en) 2018-12-13 2018-12-13 Thesis identification method, thesis identification model training method, thesis identification device, thesis identification model training device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811528227.8A CN111325001B (en) 2018-12-13 2018-12-13 Thesis identification method, thesis identification model training method, thesis identification device, thesis identification model training device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN111325001A CN111325001A (en) 2020-06-23
CN111325001B true CN111325001B (en) 2022-06-17

Family

ID=71172295

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811528227.8A Active CN111325001B (en) 2018-12-13 2018-12-13 Thesis identification method, thesis identification model training method, thesis identification device, thesis identification model training device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111325001B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8145617B1 (en) * 2005-11-18 2012-03-27 Google Inc. Generation of document snippets based on queries and search results

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA2436740A1 (en) * 2001-01-23 2002-08-01 Educational Testing Service Methods for automated essay analysis
CN106933795A (en) * 2015-12-30 2017-07-07 贺惠新 A kind of extraction method of the discussion main body of discussion type article
CN106886509B (en) * 2017-03-06 2019-12-27 大连理工大学 Automatic detection method for academic paper format

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8145617B1 (en) * 2005-11-18 2012-03-27 Google Inc. Generation of document snippets based on queries and search results

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
学术文本的结构功能识别――基于段落的识别;黄永等;《情报学报》;20160524(第05期);全文 *

Also Published As

Publication number Publication date
CN111325001A (en) 2020-06-23

Similar Documents

Publication Publication Date Title
US8355904B2 (en) Apparatus and method for detecting sentence boundaries
CN109446524A (en) A kind of voice quality detecting method and device
CN113111968B (en) Image recognition model training method, device, electronic equipment and readable storage medium
CN110717021B (en) Input text acquisition and related device in artificial intelligence interview
CN113420122B (en) Method, device, equipment and storage medium for analyzing text
CN104142912A (en) Accurate corpus category marking method and device
CN111881297A (en) Method and device for correcting voice recognition text
CN109766550A (en) A kind of text brand identification method, identification device and storage medium
CN109446300A (en) A kind of corpus preprocess method, the pre- mask method of corpus and electronic equipment
CN109753976B (en) Corpus labeling device and method
CN111160026B (en) Model training method and device, and text processing method and device
CN113626573A (en) Sales session objection and response extraction method and system
CN112597298A (en) Deep learning text classification method fusing knowledge maps
CN114282513A (en) Text semantic similarity matching method and system, intelligent terminal and storage medium
CN108536671B (en) Method and system for recognizing emotion index of text data
CN111325001B (en) Thesis identification method, thesis identification model training method, thesis identification device, thesis identification model training device, equipment and storage medium
US20220366142A1 (en) Method of machine learning and information processing apparatus
CN115964484A (en) Legal multi-intention identification method and device based on multi-label classification model
CN115455143A (en) Document processing method and device
CN106815592B (en) Text data processing method and device and wrong word recognition methods and device
CN112100368B (en) Method and device for identifying dialogue interaction intention
CN109726286B (en) Automatic book classification method based on LDA topic model
EP3757825A1 (en) Methods and systems for automatic text segmentation
CN112597776A (en) Keyword extraction method and system
CN111078869A (en) Method and device for classifying financial websites based on neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230609

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: BEIJING FOUNDER ELECTRONICS Co.,Ltd.

Address before: 100871, Beijing, Haidian District, Cheng Fu Road, No. 298, Zhongguancun Fangzheng building, 9 floor

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: PKU FOUNDER INFORMATION INDUSTRY GROUP CO.,LTD.

Patentee before: BEIJING FOUNDER ELECTRONICS Co.,Ltd.