CN114398897A - Recognition method and device based on text enhancement, electronic equipment and storage medium - Google Patents

Recognition method and device based on text enhancement, electronic equipment and storage medium Download PDF

Info

Publication number
CN114398897A
CN114398897A CN202210070886.1A CN202210070886A CN114398897A CN 114398897 A CN114398897 A CN 114398897A CN 202210070886 A CN202210070886 A CN 202210070886A CN 114398897 A CN114398897 A CN 114398897A
Authority
CN
China
Prior art keywords
text
enhancement
chinese
sentence
english
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210070886.1A
Other languages
Chinese (zh)
Inventor
王水桃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Smart City Technology Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202210070886.1A priority Critical patent/CN114398897A/en
Publication of CN114398897A publication Critical patent/CN114398897A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation

Abstract

The invention relates to the field of artificial intelligence, and provides a recognition method based on text enhancement, which comprises the steps of firstly converting a pre-constructed program sentence into an English sentence, carrying out first enhancement processing on the English sentence according to a preset English standardized processing rule to form an English sentence set, translating the English sentence set into a Chinese sentence set through a preset translation plug-in, carrying out second enhancement processing on the Chinese sentence set according to a preset Chinese processing rule to form a text sample, splicing the text sample to form sample data, inputting the sample data into a preset model to be converged to obtain a text recognition model, inputting the text to be recognized into the text recognition model to carry out text recognition processing to obtain named entity recognition information or text recognition information about the text to be recognized, thus, an effective sample set can be rapidly and pertinently generated, and the richness of the text sample is improved, and further improve the accuracy of entity marking.

Description

Recognition method and device based on text enhancement, electronic equipment and storage medium
Technical Field
The invention relates to the field of artificial intelligence, relates to a text enhancement method, and particularly relates to a recognition method and device based on text enhancement, an electronic device and a computer-readable storage medium.
Background
Named entity recognition is one of the most basic tasks of natural language processing tasks, and is one of the most important basic tasks of tasks such as reading understanding, dialog systems, machine translation and the like, and a large amount of sample data is needed for training a model capable of carrying out named entity recognition, so that data enhancement is needed firstly, and a large amount of text samples are generated to realize model training.
In computer vision, data enhancement is a very common and simple way to enrich the sample, e.g. rotate, enlarge, mirror, convert to grey scale, etc. the image does not change its semantics. However, in natural language processing, the data enhancement operation must be very careful, otherwise the semantics of the text can be easily changed, thereby resulting in poor model effect.
In the field of existing natural language processing, solutions such as unconditional enhancement, conditional enhancement, semi-supervised learning, UDA and the like are provided for solving the problem of few samples, but when the solutions are specifically landed, the speciality of the existing data set is not strong enough, and the generated samples still do not meet the use of named entity recognition.
Therefore, there is a need for a text enhancement-based identification method, device, electronic device, and storage medium, which can rapidly and pointedly generate an effective sample set to improve the richness of text samples and further improve the entity labeling accuracy.
Disclosure of Invention
The invention provides a text enhancement-based recognition method which can quickly and pertinently generate an effective sample set to improve the richness of text samples and further improve the entity marking precision, so as to solve the problem that in natural language processing, data enhancement operation must be very careful, otherwise the semantics of a text is easily changed, thereby causing poor model effect; in the field of existing natural language processing, solutions such as unconditional enhancement, conditional enhancement, semi-supervised learning, UDA and the like are provided for solving the problem of low sample number, but when the solution is specifically landed, the speciality of the existing data set is not strong enough, and the generated sample still cannot meet the use requirement of named entity recognition.
In order to achieve the above object, the present invention provides a recognition method based on text enhancement, which includes:
converting a pre-constructed program sentence into an English sentence, and performing first enhancement processing on the English sentence according to a preset English standardized processing rule to form an English sentence set;
translating the English sentence set into a Chinese sentence set through a preset translation plug-in, and performing second enhancement processing on the Chinese sentence set according to a preset Chinese processing rule to form a text sample;
splicing the text samples to form sample data, and inputting the sample data into a preset model to be converged to obtain a text recognition model;
and inputting the text to be recognized into the text recognition model for text recognition processing so as to acquire named entity recognition information or text recognition information of the text to be recognized.
Optionally, the step of pre-constructing program statements comprises:
creating a data information table according to the service requirement; wherein the data information table comprises dimension information about each dimension of the business demand;
traversing the data information table to randomly select the dimension information according to a preset dimension number to form a dimension row or a dimension column;
aggregating the dimension information in the dimension rows and the dimension columns to form simple sentences;
and carrying out secondary selection on the data information table to obtain an additional table, and adding dimension information of the additional table to the simple sentence making to form a program sentence.
Optionally, the converting the pre-constructed program statement into an english statement, and performing first enhancement processing on the english statement according to a preset english normalization processing rule to form an english statement set, includes:
carrying out sentence conversion on the program sentences to form English short sentences; the statement conversion comprises deleting program type words in the program statements, normalizing table names, column names, operator phrases and operator conversion;
carrying out standardization processing on the English short sentence to form an English sentence;
performing first enhancement processing on the English sentence based on a preset English rule to form an English sentence set; the first enhancement processing is to reserve the English sentence and add a replacement English sentence; the replacement English sentence is a new English sentence formed by replacing the English sentence by synonyms and antisense words.
Optionally, the translating, by a preset translation plug-in, the english sentence set into a chinese sentence set, and performing second enhancement processing on the chinese sentence set according to a preset chinese processing rule to form a text sample, includes:
translating the English sentences in the English sentence set throughout through a preset translation plug-in to form a Chinese sentence set;
performing second enhancement processing on the Chinese sentence set according to a preset Chinese processing rule to form a text sample; wherein the second enhancement processing comprises: chinese word replacement, Chinese word insertion and Chinese word deletion.
Optionally, the step of chinese word replacement includes:
performing word segmentation processing on the Chinese sentences in the Chinese sentence set to form Chinese word segments, and randomly replacing the Chinese word segments belonging to the data information table with any dimension information in the data information table, which is in the same column with the Chinese word segments, to form new Chinese sentences;
and adding the newly added Chinese sentences to the Chinese sentence set.
Optionally, the splicing the text samples to form sample data, and inputting the sample data into a preset model to be converged to obtain a text recognition model includes:
performing secondary word segmentation processing on the text sample to obtain word segmentation and a corresponding position of the word segmentation;
segmenting the word segmentation to obtain a word group and a corresponding position of the word group;
uploading the word segmentation, the corresponding position of the word group and the corresponding position of the word group to a knowledge base to form knowledge data;
numbering the knowledge data to form sample data; calling a keyword meaning corresponding to the sample data according to the sample data, and performing extended splicing on the sample data based on the keyword meaning to form sample semantics;
inputting the sample data into a model to be converged, so that the model to be converged obtains a prediction label according to the knowledge data, and calculating lost data according to the prediction label and the sample semantics;
and returning the lost data to the model to be converged to adjust parameters of the model to be converged according to the lost data until the model to be converged converges to obtain a text recognition model.
Optionally, the inputting the text to be recognized into the text recognition model for text recognition processing to obtain named entity recognition information or text recognition information about the text to be recognized includes:
inputting a text to be recognized into the text recognition model to perform text recognition processing so that the text to be recognized generates basic data through a basic model of the text recognition model;
enhancing the basic data through an enhancement layer of the text recognition model to form data enhancement information;
and acquiring entity naming information aiming at the text to be recognized based on the data enhancement information.
In order to solve the above problem, the present invention further provides a recognition apparatus based on text enhancement, the apparatus comprising:
the first enhancement unit is used for converting the pre-constructed program sentence into an English sentence and performing first enhancement processing on the English sentence according to a preset English standardized processing rule to form an English sentence set;
the second enhancement unit is used for translating the English sentence set into a Chinese sentence set through a preset translation plug-in and carrying out second enhancement processing on the Chinese sentence set according to a preset Chinese processing rule to form a text sample;
the model acquisition unit is used for splicing the text samples to form sample data, and inputting the sample data into a preset model to be converged to acquire a text recognition model;
and the information identification unit is used for inputting the text to be identified into the text identification model to perform text identification processing so as to acquire named entity identification information or text identification information of the text to be identified.
In order to solve the above problem, the present invention also provides an electronic device, including:
a memory storing at least one instruction; and
and the processor executes the instructions stored in the memory to realize the steps of the text enhancement-based recognition method.
In order to solve the above problem, the present invention further provides a computer-readable storage medium, which stores at least one instruction, which is executed by a processor in an electronic device to implement the text enhancement based recognition method.
The embodiment of the invention firstly converts the pre-constructed program sentence into the English sentence, carries out the first enhancement processing on the English sentence according to the preset English standardized processing rule to form an English sentence set, translates the English sentence set into the Chinese sentence set through the preset translation plug-in, and performing a second enhancement process on the Chinese sentence set according to a preset Chinese processing rule to form a text sample, then splicing the text samples to form sample data, inputting the sample data into a preset model to be converged to obtain a text recognition model, inputting the text to be recognized into the text recognition model to perform text recognition processing to obtain named entity recognition information or text recognition information of the text to be recognized, therefore, an effective sample set can be generated rapidly and pertinently, the richness of the text sample is improved, and the accuracy of entity labeling is improved.
Drawings
Fig. 1 is a schematic flowchart of a recognition method based on text enhancement according to an embodiment of the present invention;
FIG. 2 is a block diagram of a recognition apparatus based on text enhancement according to an embodiment of the present invention;
fig. 3 is a schematic diagram of an internal structure of an electronic device implementing a recognition method based on text enhancement according to an embodiment of the present invention;
the implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
In natural language processing, the data enhancement operation must be very careful, otherwise the semantics of the text can be easily changed, thereby causing poor model effect. In the field of existing natural language processing, solutions such as unconditional enhancement, conditional enhancement, semi-supervised learning, UDA and the like are provided for solving the problem of few samples, but when the solutions are specifically landed, the speciality of the existing data set is not strong enough, and the generated samples still do not meet the use of named entity recognition.
In order to solve the above problems, the present invention provides a recognition method based on text enhancement. Fig. 1 is a schematic flow chart of a recognition method based on text enhancement according to an embodiment of the present invention. The method may be performed by an apparatus, which may be implemented by software and/or hardware.
The embodiment of the application can acquire and process related data based on an artificial intelligence technology. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
In this embodiment, the recognition method based on text enhancement includes:
s1: converting a pre-constructed program sentence into an English sentence, and performing first enhancement processing on the English sentence according to a preset English standardized processing rule to form an English sentence set;
s2: translating the English sentence set into a Chinese sentence set through a preset translation plug-in, and performing second enhancement processing on the Chinese sentence set according to a preset Chinese processing rule to form a text sample;
s3: splicing the text samples to form sample data, and inputting the sample data into a preset model to be converged to obtain a text recognition model;
s4: and inputting the text to be recognized into the text recognition model for text recognition processing so as to acquire named entity recognition information or text recognition information of the text to be recognized.
In the embodiment shown in fig. 1, step S1 is a process of converting a pre-constructed program sentence into an english sentence, and performing a first enhancement process on the english sentence according to a preset english normalization processing rule to form an english sentence set; wherein the content of the first and second substances,
a step of pre-constructing program statements, comprising:
s111: creating a data information table according to the service requirement; wherein the data information table comprises dimension information about each dimension of the business demand;
s112: traversing the data information table to randomly select the dimension information according to a preset dimension number to form a dimension row or a dimension column;
s113: aggregating the dimension information in the dimension rows and the dimension columns to form simple sentences;
s114: performing secondary selection on the data information table to obtain an additional table, and adding dimension information of the additional table to the simple sentence making to form a program sentence;
specifically, in step S111, the service requirement may be any requirement related to an enterprise service, and in this embodiment, taking the vehicle information as an example, a data information table is first created, for example, a first column is a name, and includes vehicle types, such as "pasart, CR-V, A4L, and haver H6", a first column is a name of a license plate corresponding to the first column, such as "popular, honda, audi, great wall", a third column is a price corresponding to the second column, a fourth column is a category corresponding to the third column, and a fifth column is an attribute.
Then, in step S112, the data information table is traversed to randomly select the dimension information according to the preset dimension number to form a dimension row or a dimension column, that is, a certain column or several columns are randomly selected, for example: select name from car _ table, select car name from table; then, in step S113, the dimension rows and the dimension information in the dimension column are aggregated to form a simple sentence, that is, a certain column of information is aggregated, for example, select distinguishing count (brand) from car _ table, how many brands there are cars in the table are calculated, then data meeting the conditions is selected, for example, select car _ body from car _ table where price >40, the car bodies of cars with price more than 40 are selected, and then the data meeting the conditions are aggregated, for example: select avg (price) from table where name is 'BMW', i.e. calculate the average price of the car named BMW; in order to make the sentences have more varieties, the data information table is selected twice through step S114 to obtain an additional table, and dimension information of the additional table is added to the simple sentence making to form a program sentence, and in this way, by randomly selecting the data table, the data columns and the construction mode, the SQL sentences of the required number can be constructed for the subsequent data enhancement input processing;
step S1 is a process of converting the pre-constructed program sentence into an english sentence, and performing a first enhancement process on the english sentence according to a preset english normalization processing rule to form an english sentence set, including:
s121: carrying out sentence conversion on the program sentences to form English short sentences; the statement conversion comprises deleting program type words in the program statements, normalizing table names, column names, operator phrases and operator conversion;
s122: carrying out standardization processing on the English short sentence to form an English sentence;
s123: performing first enhancement processing on the English sentence based on a preset English rule to form an English sentence set; the first enhancement processing is to reserve the English sentence and add a replacement English sentence; the English sentence replacement is a new English sentence formed by replacing the English sentence with a synonym and an antisense word;
specifically, step S121 is a process of performing a sentence conversion on a program sentence to form an english short sentence, since the SQL sentence (program sentence) generated in step S114 is a program language, some rules need to be formulated to convert the SQL sentence into a natural language, a preset english normalization processing rule provided in an embodiment of the present invention is given below to construct a canonical form of a normalized dictionary, which includes normalized table names, column names, and operator phrases, as follows:
1) normalization of table names
Naming the data table as natural language, for example: car _ table- > cars
2) Operator phrase
The Sql statement conditions include equal to, less than, more than, less than or equal to, more than or equal to, unequal to, other keywords avg and the like, and the conditions are normalized into a natural language;
1.=->has/have/is/are…
2.<->less than…
3.>->more than/greater than…
4.!=->not equals…
5.avg->average
3) deleting select of the SQL statement;
then, by step S122: carrying out standardization processing on the English short sentence to form an English sentence;
for example, dimension information is connected by using conjunctions or symbols, from is replaced by for, and the like;
as another example, there may be a case where aggregation, for example, avg (price), is changed to average of price
Conditional cases: replacing operators into their normalized words/phrases
Equal/unequal to: name ═ BMW is changed into name is BMW
Greater than/less than: price >40 becomes price is 40
Then, performing first enhancement processing on the English sentence based on the step S123 to form an English sentence set; wherein, the first enhancement processing is to reserve the English sentence and add the replacement English sentence, for example, the replacement keyword where is whose
When a sentence "select avg (price) from car _ table where name ═ bmac'" is constructed in the first step, a sentence can be generated according to the above rule: average price for cars whose name is BMW;
or English sentence text replacement can be performed, word segmentation can be performed, and synonym and antisense word replacement can be performed by using nltk's work.
In the embodiment shown in fig. 1, step S2 is a process of translating the set of english sentences into a set of chinese sentences through a preset translation plug-in, and performing a second enhancement process on the set of chinese sentences according to a preset chinese processing rule to form text samples; wherein, include:
s21: translating the English sentences in the English sentence set throughout through a preset translation plug-in to form a Chinese sentence set;
s22: performing second enhancement processing on the Chinese sentence set according to a preset Chinese processing rule to form a text sample; wherein the second enhancement processing comprises: chinese word replacement, Chinese word insertion and Chinese word deletion;
the process of replacing Chinese words comprises the following steps:
SA 221: performing word segmentation processing on the Chinese sentences in the Chinese sentence set to form Chinese word segments, and randomly replacing the Chinese word segments belonging to the data information table with any dimension information in the data information table, which is in the same column with the Chinese word segments, to form new Chinese sentences;
SA 222: adding the newly added Chinese sentences to the Chinese sentence set;
such as chinese word substitution in the following order:
machine type Chinese word segmentation; only the column name of the hit data information table is selected in the word segmentation, and the column name is randomly replaced to be other column names; only hitting the data information list value in the word segmentation, randomly replacing the word with a value of any part of speech, if the word is a character string, replacing the word with another character string, and if the word is a numerical value, replacing the word with another numerical value; hit the column name and column value of the data information table at the same time in the word segmentation, can already replace other values that the column value is the same column at random, can also replace column name and column value as column name and column value of another column at the same time;
performing Chinese word insertion, namely randomly inserting a plurality of words and corresponding spelling errors; performing Chinese word deletion including stop word deletion and Tfidf deletion; wherein the content of the first and second substances,
the stop word deleting is to connect a preset stop word database through a preset deleting program, and when the deleting program searches that a new stop word appears in the stop word database, the stop word in the Chinese sentence set is deleted; stop words are words or words that can be automatically filtered to save space and improve efficiency when processing natural language text, such as punctuation marks, serial number marks (e.g., (r)), mathematical symbols, and so on. Randomly deleting a plurality of stop words;
tfidf is a commonly used weighting technique for retrieval and text mining that can be used to evaluate how important a word is for a document in a corpus or corpus, and can be deleted without affecting the semantics of the sentence because words with a low Tfidf score do not provide information.
Step S3 is a process of splicing (entity splicing) the text samples to form sample data, and inputting the sample data into a preset model to be converged to obtain a text recognition model; wherein, include:
s31: performing secondary word segmentation processing on the text sample to obtain word segmentation and a corresponding position of the word segmentation;
s32: segmenting the word segmentation to obtain a word group and a corresponding position of the word group;
s33: uploading the word segmentation, the corresponding position of the word group and the corresponding position of the word group to a knowledge base to form knowledge data;
s34: numbering the knowledge data to form sample data; calling a keyword meaning corresponding to the sample data according to the sample data, and performing extended splicing on the sample data based on the keyword meaning to form sample semantics;
s35: inputting the sample data into a model to be converged, so that the model to be converged obtains a prediction label according to the knowledge data, and calculating lost data according to the prediction label and the sample semantics;
s36: and returning the lost data to the model to be converged to adjust parameters of the model to be converged according to the lost data until the model to be converged converges to obtain a text recognition model.
Specifically, step S31 is to perform word segmentation processing on the code change data through a word segmentation tool; further segmentation is carried out through step S32, so that corresponding positions of the phrases and the phrases are obtained; numbering the knowledge data through subsequent steps to form sample data; and calling a keyword meaning corresponding to the sample data according to the sample data, performing expansion splicing on the sample data based on the keyword meaning to form a sample semantic meaning, inputting the sample data into a model to be converged to enable the model to be converged to obtain a prediction label according to the knowledge data, and calculating lost data according to the prediction label and the sample semantic meaning, namely calculating the difference between the label predicted by the model to be converged and the original sample semantic meaning, thereby further adjusting the model to be converged until the model to be converged converges to obtain a text recognition model.
In the embodiment shown in fig. 1, step S4 is a process of inputting a text to be recognized into the text recognition model to perform a text recognition process to obtain named entity identification information or text identification information about the text to be recognized; wherein, include:
s41: inputting a text to be recognized into the text recognition model to perform text recognition processing so that the text to be recognized generates basic data through a basic model of the text recognition model;
s42: enhancing the basic data through an enhancement layer of the text recognition model to form data enhancement information;
s43: and acquiring entity naming information aiming at the text to be recognized based on the data enhancement information.
Specifically, step S41 is a process of inputting the text to be recognized into the text recognition model, performing text recognition processing on the text to be recognized so that the text to be recognized generates basic data through the basic model of the text recognition model, then performing enhancement processing on the basic data through the enhancement layer of the text recognition model in step S42 so as to form data enhancement information, and then acquiring entity naming information for the text to be recognized based on the data enhancement information.
In this way, through the steps S1 and S2, by using a simple and low-resource data replacement, insertion, and deletion enhancement method, in combination with structured data and machine translation, effective text samples can be generated quickly and specifically, and then used for tasks such as intelligent question answering and NL2SQL, and because the quick enhancement of the text samples further improves the precision of the text recognition model, and further, high-precision text recognition is realized.
As described above, in the embodiment shown in fig. 1, the recognition method based on text enhancement provided by the present invention first converts a pre-constructed program sentence into an english sentence, performs a first enhancement process on the english sentence according to a preset english normalized processing rule to form an english sentence set, translates the english sentence set into a chinese sentence set through a preset translation plug-in, performs a second enhancement process on the chinese sentence set according to a preset chinese processing rule to form a text sample, then splices the text samples to form sample data, inputs the sample data into a preset model to be converged to obtain a text recognition model, and then performs a text recognition process on the text recognition model to be recognized to obtain named entity recognition information or text recognition information about the text to be recognized, so that an effective sample set can be generated rapidly and in a targeted manner, therefore, the richness of the text sample is improved, and the accuracy of entity labeling is further improved.
Therefore, the recognition method based on text enhancement in the embodiment of the invention has the following advantages: 1. starting to construct program statements such as SQL statements in a table form, and then carrying out normalized processing according to the SQL statements to form English statements, so that a large amount of statements are constructed; 2. performing second enhancement processing on the Chinese sentence set to form a text sample; wherein the second enhancement processing comprises: the method has the advantages that the method can quickly and pertinently generate an effective sample set by using a simple low-resource data replacement, insertion and deletion enhancement method and combining structured data and machine translation; 3. a large number of text samples are applied to tasks with text recognition characteristics such as intelligent question answering and NL2SQL, and text recognition accuracy is greatly improved.
As shown in fig. 2, the present invention provides a recognition apparatus 100 based on text enhancement, and the present invention can be installed in an electronic device. According to the implemented functions, the recognition device 100 based on text enhancement can comprise a first enhancement unit 101, a second enhancement unit 102, a model acquisition unit 103 and an information recognition unit 104. The module of the present invention, which may also be referred to as a unit, refers to a series of computer program segments that can be executed by a processor of an electronic device and that can perform a fixed function, and that are stored in a memory of the electronic device.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the first enhancement unit 101 is configured to convert a pre-constructed program sentence into an english sentence, and perform first enhancement processing on the english sentence according to a preset english normalization processing rule to form an english sentence set;
a second enhancement unit 102, configured to translate the set of english sentences into a set of chinese sentences through a preset translation plug-in, and perform a second enhancement processing on the set of chinese sentences according to a preset chinese processing rule to form a text sample;
the model obtaining unit 103 is configured to splice the text samples to form sample data, and input the sample data into a preset model to be converged to obtain a text recognition model;
an information identification unit 104, configured to input a text to be identified into the text identification model for text identification processing to acquire named entity identification information or text identification information about the text to be identified.
The process that the first enhancement unit 101 converts the pre-constructed program sentence into an english sentence, and performs the first enhancement processing on the english sentence according to the preset english normalization processing rule to form an english sentence set includes:
s111: creating a data information table according to the service requirement; wherein the data information table comprises dimension information about each dimension of the business demand;
s112: traversing the data information table to randomly select the dimension information according to a preset dimension number to form a dimension row or a dimension column;
s113: aggregating the dimension information in the dimension rows and the dimension columns to form simple sentences;
s114: performing secondary selection on the data information table to obtain an additional table, and adding dimension information of the additional table to the simple sentence making to form a program sentence;
s121: carrying out sentence conversion on the program sentences to form English short sentences; the statement conversion comprises deleting program type words in the program statements, normalizing table names, column names, operator phrases and operator conversion;
s122: carrying out standardization processing on the English short sentence to form an English sentence;
s123: performing first enhancement processing on the English sentence based on a preset English rule to form an English sentence set; the first enhancement processing is to reserve the English sentence and add a replacement English sentence; the English sentence replacement is a new English sentence formed by replacing the English sentence with a synonym and an antisense word;
the process that the second enhancement unit 102 translates the english sentence set into a chinese sentence set through a preset translation plug-in, and performs second enhancement processing on the chinese sentence set according to a preset chinese processing rule to form a text sample includes:
s21: translating the English sentences in the English sentence set throughout through a preset translation plug-in to form a Chinese sentence set;
s22: performing second enhancement processing on the Chinese sentence set according to a preset Chinese processing rule to form a text sample; wherein the second enhancement processing comprises: chinese word replacement, Chinese word insertion and Chinese word deletion;
the process of replacing Chinese words comprises the following steps:
SA 221: performing word segmentation processing on the Chinese sentences in the Chinese sentence set to form Chinese word segments, and randomly replacing the Chinese word segments belonging to the data information table with any dimension information in the data information table, which is in the same column with the Chinese word segments, to form new Chinese sentences;
SA 222: adding the newly added Chinese sentences to the Chinese sentence set;
the process that the model obtaining unit 103 splices the text samples to form sample data, and inputs the sample data into a preset model to be converged to obtain a text recognition model includes:
s31: performing secondary word segmentation processing on the text sample to obtain word segmentation and a corresponding position of the word segmentation;
s32: segmenting the word segmentation to obtain a word group and a corresponding position of the word group;
s33: uploading the word segmentation, the corresponding position of the word group and the corresponding position of the word group to a knowledge base to form knowledge data;
s34: numbering the knowledge data to form sample data; calling a keyword meaning corresponding to the sample data according to the sample data, and performing extended splicing on the sample data based on the keyword meaning to form sample semantics;
s35: inputting the sample data into a model to be converged, so that the model to be converged obtains a prediction label according to the knowledge data, and calculating lost data according to the prediction label and the sample semantics;
s36: transmitting the lost data back to the model to be converged to adjust parameters of the model to be converged according to the lost data until the model to be converged converges to obtain a text recognition model;
the process of the information recognition unit 104 inputting a text to be recognized into the text recognition model for text recognition processing to acquire named entity recognition information or text recognition information about the text to be recognized includes:
s41: inputting a text to be recognized into the text recognition model to perform text recognition processing so that the text to be recognized generates basic data through a basic model of the text recognition model;
s42: enhancing the basic data through an enhancement layer of the text recognition model to form data enhancement information;
s43: and acquiring entity naming information aiming at the text to be recognized based on the data enhancement information.
For specific implementation, reference may be made to the description of relevant steps in the embodiment corresponding to fig. 1, which is not described herein again.
The recognition device based on text enhancement of the embodiment of the invention firstly converts a pre-constructed program sentence into an English sentence through a first enhancement unit 101, performs first enhancement processing on the English sentence according to a preset English standardized processing rule to form an English sentence set, then translates the English sentence set into a Chinese sentence set through a preset translation plug-in unit based on a second enhancement unit 102, performs second enhancement processing on the Chinese sentence set according to a preset Chinese processing rule to form a text sample, then splices the text sample through a model acquisition unit 103 to form sample data, inputs the sample data into a preset model to be converged to acquire a text recognition model, and then performs text recognition processing on the text recognition model to be recognized by using an information recognition unit 104 to acquire named entity recognition information or text recognition information about the text to be recognized, therefore, an effective sample set can be generated rapidly and pertinently, the richness of the text sample is improved, and the accuracy of entity labeling is improved.
The recognition device based on the text enhancement provided by the invention has the following advantages: 1. starting to construct program statements such as SQL statements in a table form, and then carrying out normalized processing according to the SQL statements to form English statements, so that a large amount of statements are constructed; 2. performing second enhancement processing on the Chinese sentence set to form a text sample; wherein the second enhancement processing comprises: the method has the advantages that the method can quickly and pertinently generate an effective sample set by using a simple low-resource data replacement, insertion and deletion enhancement method and combining structured data and machine translation; 3. a large number of text samples are applied to tasks with text recognition characteristics such as intelligent question answering and NL2SQL, and text recognition accuracy is greatly improved.
As shown in fig. 3, the present invention provides an electronic device 1 based on a text-enhanced recognition method.
The electronic device 1 may comprise a processor 10, a memory 11 and a bus, and may further comprise a computer program, such as a text-based enhanced recognition program 12, stored in the memory 11 and executable on said processor 10.
The memory 11 includes at least one type of readable storage medium, which includes flash memory, removable hard disk, multimedia card, card-type memory (e.g., SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, etc. The memory 11 may in some embodiments be an internal storage unit of the electronic device 1, such as a removable hard disk of the electronic device 1. The memory 11 may also be an external storage device of the electronic device 1 in other embodiments, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 1. Further, the memory 11 may also include both an internal storage unit and an external storage device of the electronic device 1. The memory 11 may be used not only for storing application software installed in the electronic device 1 and various types of data, such as codes based on text-enhanced recognition, etc., but also for temporarily storing data that has been output or is to be output.
The processor 10 may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor 10 is a Control Unit (Control Unit) of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device 1 by running or executing programs or modules (e.g., text-based enhanced recognition programs, etc.) stored in the memory 11 and calling data stored in the memory 11.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connection communication between the memory 11 and at least one processor 10 or the like.
Fig. 3 shows only an electronic device with components, and it will be understood by those skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 1, and may comprise fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device 1 may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor 10 through a power management device, so as to implement functions of charge management, discharge management, power consumption management, and the like through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device 1 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device 1 may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used for establishing a communication connection between the electronic device 1 and other electronic devices.
Optionally, the electronic device 1 may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the electronic device 1 and for displaying a visualized user interface, among other things.
It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.
The memory 11 in the electronic device 1 stores a text-based enhanced recognition program 12 that is a combination of instructions that, when executed in the processor 10, enable:
converting a pre-constructed program sentence into an English sentence, and performing first enhancement processing on the English sentence according to a preset English standardized processing rule to form an English sentence set;
translating the English sentence set into a Chinese sentence set through a preset translation plug-in, and performing second enhancement processing on the Chinese sentence set according to a preset Chinese processing rule to form a text sample;
splicing the text samples to form sample data, and inputting the sample data into a preset model to be converged to obtain a text recognition model;
and inputting the text to be recognized into the text recognition model for text recognition processing so as to acquire named entity recognition information or text recognition information of the text to be recognized.
Specifically, the specific implementation method of the processor 10 for the instruction may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, which is not described herein again. It is emphasized that, in order to further ensure the privacy and security of the text-based enhanced recognition, the data of the text-based enhanced recognition is stored in the node of the blockchain where the server cluster is located.
The server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform.
Further, the integrated modules/units of the electronic device 1, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. The computer-readable medium may include: any entity or device capable of carrying said computer program code, recording medium, U-disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM).
An embodiment of the present invention further provides a computer-readable storage medium, where the storage medium may be nonvolatile or volatile, and the storage medium stores a computer program, and when the computer program is executed by a processor, the computer program implements:
converting a pre-constructed program sentence into an English sentence, and performing first enhancement processing on the English sentence according to a preset English standardized processing rule to form an English sentence set;
translating the English sentence set into a Chinese sentence set through a preset translation plug-in, and performing second enhancement processing on the Chinese sentence set according to a preset Chinese processing rule to form a text sample;
splicing the text samples to form sample data, and inputting the sample data into a preset model to be converged to obtain a text recognition model;
and inputting the text to be recognized into the text recognition model for text recognition processing so as to acquire named entity recognition information or text recognition information of the text to be recognized.
Specifically, the specific implementation method of the computer program when being executed by the processor may refer to the description of the relevant steps in the text enhancement based recognition method in the embodiment, which is not described herein again.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof.
The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the system claims may also be implemented by one unit or means in software or hardware. The terms second, etc. are used to denote names, but not any particular order.
Finally, it should be noted that the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention is described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention.

Claims (10)

1. A recognition method based on text enhancement is characterized by comprising the following steps:
converting a pre-constructed program sentence into an English sentence, and performing first enhancement processing on the English sentence according to a preset English standardized processing rule to form an English sentence set;
translating the English sentence set into a Chinese sentence set through a preset translation plug-in, and performing second enhancement processing on the Chinese sentence set according to a preset Chinese processing rule to form a text sample;
splicing the text samples to form sample data, and inputting the sample data into a preset model to be converged to obtain a text recognition model;
and inputting the text to be recognized into the text recognition model for text recognition processing so as to acquire named entity recognition information or text recognition information of the text to be recognized.
2. The text-based enhanced recognition method of claim 1 wherein the step of pre-constructing a program statement comprises:
creating a data information table according to the service requirement; wherein the data information table comprises dimension information about each dimension of the business demand;
traversing the data information table to randomly select the dimension information according to a preset dimension number to form a dimension row or a dimension column;
aggregating the dimension information in the dimension rows and the dimension columns to form simple sentences;
and carrying out secondary selection on the data information table to obtain an additional table, and adding dimension information of the additional table to the simple sentence making to form a program sentence.
3. The recognition method based on text enhancement as claimed in claim 2, wherein the converting the pre-constructed program sentence into an english sentence and performing the first enhancement processing on the english sentence according to the preset english normalization processing rule to form an english sentence set comprises:
carrying out sentence conversion on the program sentences to form English short sentences; the statement conversion comprises deleting program type words in the program statements, normalizing table names, column names, operator phrases and operator conversion;
carrying out standardization processing on the English short sentence to form an English sentence;
performing first enhancement processing on the English sentence based on a preset English rule to form an English sentence set; the first enhancement processing is to reserve the English sentence and add a replacement English sentence; the replacement English sentence is a new English sentence formed by replacing the English sentence by synonyms and antisense words.
4. The recognition method based on text enhancement as claimed in claim 1, wherein the translating the set of english sentences into a set of chinese sentences through a preset translation plug-in, and performing a second enhancement process on the set of chinese sentences according to a preset chinese processing rule to form text samples comprises:
translating the English sentences in the English sentence set throughout through a preset translation plug-in to form a Chinese sentence set;
performing second enhancement processing on the Chinese sentence set according to a preset Chinese processing rule to form a text sample; wherein the second enhancement processing comprises: chinese word replacement, Chinese word insertion and Chinese word deletion.
5. The text-based enhanced recognition method of claim 4 wherein said step of Chinese word substitution comprises:
performing word segmentation processing on the Chinese sentences in the Chinese sentence set to form Chinese word segments, and randomly replacing the Chinese word segments belonging to the data information table with any dimension information in the data information table, which is in the same column with the Chinese word segments, to form new Chinese sentences;
and adding the newly added Chinese sentences to the Chinese sentence set.
6. The recognition method based on text enhancement as claimed in claim 1, wherein the splicing the text samples to form sample data and inputting the sample data into a preset model to be converged to obtain a text recognition model comprises:
performing secondary word segmentation processing on the text sample to obtain word segmentation and a corresponding position of the word segmentation;
segmenting the word segmentation to obtain a word group and a corresponding position of the word group;
uploading the word segmentation, the corresponding position of the word group and the corresponding position of the word group to a knowledge base to form knowledge data;
numbering the knowledge data to form sample data; calling a keyword meaning corresponding to the sample data according to the sample data, and performing extended splicing on the sample data based on the keyword meaning to form sample semantics;
inputting the sample data into a model to be converged, so that the model to be converged obtains a prediction label according to the knowledge data, and calculating lost data according to the prediction label and the sample semantics;
and returning the lost data to the model to be converged to adjust parameters of the model to be converged according to the lost data until the model to be converged converges to obtain a text recognition model.
7. The recognition method based on text enhancement as claimed in claim 1, wherein the inputting the text to be recognized into the text recognition model for text recognition processing to obtain named entity recognition information or text recognition information about the text to be recognized comprises:
inputting a text to be recognized into the text recognition model to perform text recognition processing so that the text to be recognized generates basic data through a basic model of the text recognition model;
enhancing the basic data through an enhancement layer of the text recognition model to form data enhancement information;
and acquiring entity naming information aiming at the text to be recognized based on the data enhancement information.
8. An apparatus for recognition based on text enhancement, the apparatus comprising:
the first enhancement unit is used for converting the pre-constructed program sentence into an English sentence and performing first enhancement processing on the English sentence according to a preset English standardized processing rule to form an English sentence set;
the second enhancement unit is used for translating the English sentence set into a Chinese sentence set through a preset translation plug-in and carrying out second enhancement processing on the Chinese sentence set according to a preset Chinese processing rule to form a text sample;
the model acquisition unit is used for splicing the text samples to form sample data, and inputting the sample data into a preset model to be converged to acquire a text recognition model;
and the information identification unit is used for inputting the text to be identified into the text identification model to perform text identification processing so as to acquire named entity identification information or text identification information of the text to be identified.
9. An electronic device, characterized in that the electronic device comprises:
at least one processor; and the number of the first and second groups,
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the steps of the text enhancement based recognition method as claimed in any one of claims 1 to 7.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a text enhancement based recognition method according to any one of claims 1 to 7.
CN202210070886.1A 2022-01-21 2022-01-21 Recognition method and device based on text enhancement, electronic equipment and storage medium Pending CN114398897A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210070886.1A CN114398897A (en) 2022-01-21 2022-01-21 Recognition method and device based on text enhancement, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210070886.1A CN114398897A (en) 2022-01-21 2022-01-21 Recognition method and device based on text enhancement, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114398897A true CN114398897A (en) 2022-04-26

Family

ID=81232702

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210070886.1A Pending CN114398897A (en) 2022-01-21 2022-01-21 Recognition method and device based on text enhancement, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114398897A (en)

Similar Documents

Publication Publication Date Title
CN112016304A (en) Text error correction method and device, electronic equipment and storage medium
US11055327B2 (en) Unstructured data parsing for structured information
US9639522B2 (en) Methods and apparatus related to determining edit rules for rewriting phrases
CN113312461A (en) Intelligent question-answering method, device, equipment and medium based on natural language processing
CN113672781A (en) Data query method and device, electronic equipment and storage medium
WO2022134355A1 (en) Keyword prompt-based search method and apparatus, and electronic device and storage medium
CN113886527A (en) Natural language semantic extraction method and system
Abdurakhmonova et al. Linguistic functionality of Uzbek Electron Corpus: uzbekcorpus. uz
CN112597768B (en) Text auditing method, device, electronic equipment, storage medium and program product
CN113821622A (en) Answer retrieval method and device based on artificial intelligence, electronic equipment and medium
CN113887941A (en) Business process generation method and device, electronic equipment and medium
CN112668281A (en) Automatic corpus expansion method, device, equipment and medium based on template
CN112069824A (en) Region identification method, device and medium based on context probability and citation
CN113420542B (en) Dialogue generation method, device, electronic equipment and storage medium
Li et al. Automatic ontology generation from patents using a pre-built library, WordNet and a class-based n-gram model
CN115146064A (en) Intention recognition model optimization method, device, equipment and storage medium
CN114398897A (en) Recognition method and device based on text enhancement, electronic equipment and storage medium
CN112287676A (en) New word discovery method, device, electronic equipment and medium
CN112632264A (en) Intelligent question and answer method and device, electronic equipment and storage medium
CN113342941B (en) Text search method and device, electronic equipment and computer readable storage medium
CN114969385B (en) Knowledge graph optimization method and device based on document attribute assignment entity weight
WO2022227170A1 (en) Method and apparatus for generating cross-language word vector, electronic device, and storage medium
CN113627187A (en) Named entity recognition method and device, electronic equipment and readable storage medium
CN114881037A (en) Named entity recognition method and device, electronic equipment and storage medium
CN113361273A (en) Word segmentation method and device based on unknown words, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination