CN114416979A - Text query method, text query equipment and storage medium - Google Patents

Text query method, text query equipment and storage medium Download PDF

Info

Publication number
CN114416979A
CN114416979A CN202111663305.7A CN202111663305A CN114416979A CN 114416979 A CN114416979 A CN 114416979A CN 202111663305 A CN202111663305 A CN 202111663305A CN 114416979 A CN114416979 A CN 114416979A
Authority
CN
China
Prior art keywords
text
word
model
preset
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111663305.7A
Other languages
Chinese (zh)
Inventor
焦彦嘉
王义山
谷松涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jujun Technology Co ltd
Original Assignee
Shanghai Jujun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jujun Technology Co ltd filed Critical Shanghai Jujun Technology Co ltd
Priority to CN202111663305.7A priority Critical patent/CN114416979A/en
Publication of CN114416979A publication Critical patent/CN114416979A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a text query method, text query equipment and a storage medium. Acquiring a standard training data set, training a target text classification model based on the standard training data set, calling the target text classification model, inputting a text to be classified into the target text classification model for correlation matching, acquiring a text classification result output by the target text classification model, marking the text to be classified as a text to be participled and performing word segmentation under the condition that the text to be classified is matched to a server background, and acquiring a word segmentation result comprising word vectors corresponding to the participles; and calculating the similarity between the word vector and a word vector model corresponding to the word in the preset target word list, and outputting corresponding category information. By the technical scheme, the registration files are systematically and comprehensively inquired, the repeated registration condition of the registration files can be accurately inquired, the repeated judgment misjudgment rate can be reduced, the condition that the user omits the inquiry due to the change of the text description is further avoided, and the efficiency of inquiring the repeated registration files is improved.

Description

Text query method, text query equipment and storage medium
Technical Field
The present application relates to the field of information processing technologies, and in particular, to a text query method, text query equipment, and a storage medium.
Background
When the fields of financial institutions and the like register the mobile asset information in a service, the server background is required to inquire whether the mobile asset information is registered or not, the files registered by the server background and the third-party server are required to be inquired as texts so as to check whether the information in the text file to be registered is registered in the server background or the third-party server or not, and at the moment, the text registration file is required to be inquired in advance to judge whether the repeated registration occurs or not. In the prior art, when detailed information of a text registration file is not completely recorded in a background service, a search of a simple keyword cannot retrieve the registration file, if further information is needed to verify whether the text file is registered, a third-party server with registration information needs to be manually inquired for further inquiry, which is easy to miss, and when approximate expression of the same registration file exists, input related inquiry information is often manually checked. The defects of the manual auditing mode are as follows: on one hand, the auditing precision is still the experience of an auditor, errors often occur and the efficiency is low; on the other hand, although the literal information of the text is not consistent, the literal information of the text is actually the same, for example, although the literal information of the text is not consistent, the literal information of the text is not consistent between the "office appliance of company A" and the "office equipment of company A", but the text is actually directed to the same registration file, in the above case, it is difficult to acquire accurate information by simple keywords or manual review and inquiry.
Based on the above prior art, a text query method is needed to further precisely implement text information matching query of the registered file.
Disclosure of Invention
The text query method realizes text similarity identification and repeated text query through a semantic understanding method, further realizes that a registered file is comprehensively queried systematically through the semantic understanding method, can accurately query the registered specific platform of the registered file, simultaneously reduces misjudgment rate, avoids omission caused by text description replacement of a user, and improves efficiency of querying the repeatedly registered file.
A first aspect of the present invention provides a text query method, which specifically includes:
acquiring a standard training data set, wherein the training data set comprises a positive sample and a negative sample, and the positive sample comprises a first text set stored in a server background; the negative examples include a second set of text stored at the at least one third-party server;
training a target text classification model based on a standard training data set;
calling a target text classification model, inputting a text to be classified into the target text classification model for correlation matching, and acquiring a text classification result output by the target text classification model;
judging whether the text to be classified belongs to a server background or a third-party server according to the text classification result, marking the text to be classified as the text to be participled and performing word segmentation under the condition that the text to be classified is matched with the server background, wherein the obtained word segmentation result comprises word vectors corresponding to the participles;
and calculating the similarity between the word vector and a word vector model corresponding to the word in the preset target word list, and outputting corresponding category information.
In one possible implementation manner of the present application, a loss value including a first text set is obtained based on a first loss function, and a loss value including a second text set is obtained based on a second loss function;
determining a loss value for a standard training data set based on a loss value comprising a first set of text and a loss value comprising a second set of text;
and adjusting the model parameter value of the preset Bert model by using the loss value of the standard training data set, and training the target text classification model.
Further, on the basis of the preset Bert model training, sentence vectors and part-of-speech vectors of text information of a standard training data set are obtained according to preset rules, and parameter values of the structures of the full connection layer and the output layer are correspondingly updated to form an adjustment model;
and performing iterative training on the adjustment model according to the loss function, and calculating a loss function value in the parameter value iterative training process.
Further, under the condition that the numerical values of the first loss function and the second loss function are converged, whether the numerical values of the first loss function and the second loss function are smaller than a preset threshold value is judged;
if yes, counting the accuracy of the test texts in the standard training data set, finishing the training process and saving the model structure as the target text classification model under the condition that the accuracy of the test texts is greater than the preset accuracy.
Further, adding a Softmax function to an output layer of a preset Bert model;
and classifying the characteristic vectors corresponding to the preset labeled training data set by a Softmax function to obtain a loss function, and performing iterative training on the adjustment model according to the loss function.
In one possible implementation manner of the application, a segmentation symbol identifier is added at a designated position of a text to be classified, and the text to be queried after the segmentation symbol identifier is added is input into a target text classification model;
and performing correlation matching on the text to be classified and the registered text file according to the target text classification model.
In one possible implementation manner of the present application, the text mapping is performed on the text to be participled according to a preset text query dictionary.
Further, under the condition that the text mapping is empty, performing word segmentation on the text to be word segmented through a probability statistics word segmentation model to obtain a word segmentation result corresponding to the text to be word segmented;
and obtaining a Word vector sequence corresponding to each participle in the participle result through a Word2Vec model for converting the participle into the vector.
In one possible implementation manner of the application, a word vector sequence is input into an offline model, and the text similarity of a word vector and a preset target category word list trained by the offline model in a text space is calculated;
and performing weight calculation on the word vector and the preset target category vocabulary based on the text similarity of the same terms and word frequency and the text similarity in a text space to obtain the final similarity of the word vector and the preset target category vocabulary.
Further, obtaining pre-training sentences for word segmentation to obtain corresponding word segmentation sets, and obtaining word vector sets by using the word segmentation sets through one-hot coding;
obtaining central words of a word sequence corresponding to a pre-training sentence, and determining central word vectors corresponding to the central words;
and inputting the word vector set into an off-line model which is trained in advance, and keeping a central word vector sequence corresponding to the central word as a preset target category word list.
In a second aspect of the present application, there is provided an electronic device including a memory for storing a processing program; and the processor is used for realizing the text query method of any one item when executing the processing program.
In a second aspect of the present application, a readable storage medium is provided, wherein the readable storage medium has a processing program stored thereon, and the processing program, when executed by a processor, implements a text query method as in any one of the foregoing.
The invention has the following beneficial technical effects:
1. according to the technical scheme, the text classification model is trained through a large number of preset standard training data sets, so that accurate semantic understanding of the contents of the registration files pre-registered in the server background or the registration files which are registered already is achieved, and accurate query of the registration files appearing in a third-party server is facilitated.
2. The method has the advantages that the similar word expressions of the registered files are accurately screened by segmenting the belonging text to be segmented and combining the offline model which is trained by the server background, whether the similar expressions are the same text expression or not is accurately judged, the misjudgment rate of repeated registration of the registered files is reduced, the repeatedly registered text files are timely screened out, the repeated registration of users is avoided, and the registration accuracy and efficiency of the registered files are improved.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments with reference to the following drawings:
FIG. 1 is a flow diagram illustrating a text query method according to an embodiment of the present application;
FIG. 2 illustrates a schematic diagram of a Bert framework structure, according to an embodiment of the present application;
FIG. 3 illustrates a schematic flow chart diagram of a text query according to an embodiment of the present application.
Detailed Description
The present invention will be described in detail with reference to specific examples. The following examples will assist those skilled in the art in further understanding the invention, but are not intended to limit the invention in any way. It should be noted that it would be obvious to those skilled in the art that various changes and modifications can be made without departing from the spirit of the invention. All falling within the scope of the present invention.
The term "include" and variations thereof as used herein is meant to be inclusive in an open-ended manner, i.e., "including but not limited to". Unless specifically stated otherwise, the term "or" means "and/or". The term "based on" means "based at least in part on". The terms "one example embodiment" and "one embodiment" mean "at least one example embodiment". The term "another embodiment" means "at least one additional embodiment". The terms "first," "second," and the like may refer to different or the same object. Other explicit and implicit definitions are also possible below.
The method and the device aim to solve the problems that in the prior art, repeated registration of the registration file is inquired in a manual review mode, and meanwhile, the similar word expression review of the registration file appearing in the server background is inaccurate and low in review efficiency. The application provides a text query method, text query equipment and a readable medium. The method has the advantages that the registered text is inquired through the model trained in advance, accurate inquiry of the registered position and accurate inquiry of the similar word expression of the same server background are realized, repeated registration of the same registered file is avoided, meanwhile, the efficiency and accuracy of inquiring and registering the repeated registration of the registered file can be improved, and the condition that the user is missed in inquiry due to text description replacement is further avoided.
Specifically, based on the knowledge of the search methods for similar texts and repeated texts, a text search method applied to a text-registered document registration platform will be described below.
In some embodiments of the present application, as computer computing power continues to increase and large-scale corpora continue to be published, more and more pre-trained models of universal language tokens emerge. The relevant data (e.g., text registry files) may be retrieved and processed based on artificial intelligence techniques. Among them, Artificial Intelligence (AI) is a theory, method, technique and application system that simulates, extends and expands human Intelligence using a digital computer or a machine controlled by a digital computer, senses the environment, acquires knowledge and uses the knowledge to obtain the best result.
Specifically, fig. 1 shows a flow diagram of a text query according to an embodiment of the present application:
step 100: acquiring a standard training data set, wherein the training data set comprises a positive sample and a negative sample, and the positive sample comprises a first text set which is pre-stored in a server background; the negative examples include a second set of text pre-stored at the at least one third-party server. It can be understood that before training the registered file as a standard training data set, the registered file needs to be recognized, an OCR (Optical Character Recognition) model can be used to recognize and upload the text registered file stored in the background of the server and the text registered file stored in the third-party server, wherein the information of the text registered file can be text information such as characters, letters, numbers, and the like, and a method based on semantic understanding realizes that sufficient data set information is needed for pre-training the standard training data set, and the data set at least includes texts of all relevant text information registered in the background of the server as positive samples and texts of relevant text information appearing in the third-party server as negative samples, and the semantic Recognition is realized by sufficient preset training data set.
In some embodiments of the present application, the document to be registered is scanned by using an OCR technology, image information of the original registered document is collected, and identification of the document to be registered is achieved by using the information collection, specifically, the OCR technology may be implemented by using an open-source model such as open-source easyaccr, Chineseocr, PaddleOCR, and is not limited herein.
In some embodiments of the present application, the positive exemplar labels include, but are not limited to, words, letters, numbers, and the like, and in one example, the positive exemplar label may be set to 1. In some embodiments, the negative examples labels include, but are not limited to, words, letters, numbers, and the like, and in one example, the negative examples label may be set to 0. Further, for example, in some embodiments of the present application, most of the text registration documents are registered in the server background, and a small part of the text registration documents are registered in the third party's platform, when a standard training data set is preset, in order to prevent the occurrence of imbalance of pre-trained samples, data information is omitted, and a trained model is preferentially recognized as a text registration document to be registered in the server background during text query, so that in order to implement fairness of training model samples and accuracy of query, the pre-trained model needs to improve the registered sample proportion of the third party server as much as possible during training.
In some embodiments of the present application, the server background may be a mid-log-on platform, and the third-party server may be a server background of a third-party platform such as a payable chain platform, which is not limited herein.
Step 200: the target text classification model is trained based on a standard training data set. It can be appreciated that pre-training a target text classification model aims to pre-train a language model on a large-scale unlabeled corpus to obtain generic, context-dependent feature representations, initialize the model using the feature representations, and finally fine-tune parameter values of the model in specific downstream tasks to achieve a better model effect. The pre-training model can learn from large-scale linguistic data to obtain universal language expression, is favorable for improving the generalization capability of the model and accelerating the convergence of the model, and further realizes systematic and comprehensive query on the registered files by a semantic understanding method to prevent the repeated registration of the registered files.
In the above step 200, the loss value of the first text set is obtained based on the first loss function, and the loss value of the second text set is obtained based on the second loss function; determining a loss value of a standard training data set based on the loss value of the first text set and the loss value of the second text set; and adjusting the model parameter value of the preset Bert model by using the loss value of the standard training data set, and training the target text classification model.
Specifically, as shown in fig. 2, which is a schematic diagram of an architecture of a Bert pre-training model, it can be understood that the Bert model is a model after a large amount of corpora are trained, for example, the lightweight Bert model Bert-Base has a 12-layer network structure, 768 hidden units, and 12 Attention heads, which are 110M parameters. In order to support the support of various downstream tasks, the trained preset Bert model designs a universal input representation, wherein, "[ CLS ]" is a learnable identifier and can capture global information of text input, and "[ SEP ]" is only a separator of input 1 and input 2, the preset Bert model Inputs the universal input representation into a bidirectional Transform encoder, and bidirectional representation of a context is obtained through training of the text in the left direction, the right direction and the left direction.
In some embodiments of the present application, in order to finally perform fine tuning on parameters in a specific downstream task to achieve a better model effect, the Bert model is fine-tuned by using Bert as a pre-training model to extract global features of a text, wherein the fine tuning is a supervised learning process. Specifically, a training sample with a standard required for implementing the fine tuning model may be used as an extraction sample to perform fine tuning on the Bert model based on a labeled training standard data set, where a loss value obtained by a first loss function represents a difference between a real text category and a predicted text category in a first text set, and a loss value obtained by a second loss function represents a difference between a real text category and a predicted text category in a second text set, where the larger the first loss function value and/or the second loss function value is, the larger the difference between the real text type and the predicted text category in the first text set and/or the second text set is, the worse the text classification effect of the target text classification model in the training process is. Based on this, the loss function value of the fine tuning model is further calculated by the first loss function and/or the second loss function value, and the target classification model can be further adjusted based on the value of the loss function so as to adopt the adjusted target classification model in the next training.
In some embodiments of the application, on the basis of preset Bert model training, sentence vectors and part-of-speech vectors of text information of a standard training data set are obtained according to preset rules to correspondingly update parameter values of structures of a full connection layer and an output layer, so as to form an adjustment model; and performing iterative training on the adjustment model according to the loss function, and calculating a loss function value in the parameter value iterative training process. It will be appreciated that fixed Bert model parameters are loaded first, i.e., the initialized values for the model parameters are not random, but are initialized based on existing parameter values. The same network structure is used, the same parameters are used, the model is trained according to a specific classification task, the names and the number of the parameters are not changed in the process, only the parameter values are changed, the parameter value is changed, a standard training data sentence vector and a part of speech vector are obtained according to a preset rule, and then the parameter values of the whole connection layer and output layer structures of the Bert model are adjusted.
In some embodiments of the present application, further, in a case that the value of the loss function converges, determining whether the value of the loss function is smaller than a preset loss function preset threshold; if yes, counting the accuracy of the test texts in the standard training data set, finishing the training process and saving the model structure as the target text classification model under the condition that the accuracy of the test texts is greater than the preset accuracy. It can be understood that, when the target text classification model is trained, the pre-acquired training data set may be further divided according to a certain proportion, and the pre-acquired training data set is respectively marked as a first text set or a second text set serving as training data and a test text serving as a test set, for example, the pre-acquired data set is acquired as 1000, and the pre-acquired training data set may be divided according to a certain proportion according to the actually required accuracy, precision, and the like of the target text classification model, for example, the pre-acquired training data set and the test text may be divided according to a corresponding proportion of 7:3 or 8:2, and the like, which is not limited herein.
Further, adding a Softmax function to an output layer of a preset Bert model; and classifying the characteristic vectors corresponding to the preset labeled training data set by a Softmax function to obtain a loss function, and performing iterative training on the adjustment model according to the loss function.
It can be understood that the model parameters are adjusted according to the loss function, and the iterative training is continued, so that the model is trained to a certain degree. And calculating the numerical values of the first loss function and the second loss function according to the prediction result, iteratively updating the preset Bert model according to the first loss function and the second loss function, determining whether the iteratively updated Bert network model is converged, and when the iteratively updated preset Bert model is converged. Further, the loss function can be selected to be a common multi-classification loss function of Softmax, that is, when the parameters are subjected to iterative training, the model is trained, the value of the loss function is calculated at the same time, the accuracy of the test set of the current model is counted based on the labeled test corpus, when the value of the loss function is converged and reaches a preset threshold value, the accuracy of semantic understanding is counted, and the training of the target text classification model can be considered to be completed under the condition that the accuracy of the semantic understanding is greater than the preset accuracy.
In some embodiments of the present application, in the process of the target text classification model, it may be considered that a flag that the target classification model has been trained may be that a loss function of a first text set and a loss function of a second text set serving as a training set are in a convergence state at the same time, and specifically, under a condition that values of the first loss function and the second loss function are converged, it is determined whether values of the first loss function and the second loss function are smaller than a preset threshold, where the preset threshold of the first loss function and the preset threshold of the second loss are the same as the preset threshold; and the accuracy of the test text is larger than the preset accuracy while being smaller than the threshold of the preset loss function, for example, 5 samples of 10 test corpora are registered by a third-party server, 5 server backgrounds are used, and when the correct number of semantic comprehensions of the accuracy reaches 9 and the accuracy of the target text classification model for the classification query of the text reaches 90%, the training of the target text classification model is considered to be finished, and the input text to be classified can be accurately classified through the text classification model.
Step 300: and calling a target text classification model, inputting the text to be classified into the target text classification model for correlation matching, and acquiring a text classification result output by the target text classification model. It can be understood that when a user needs to pre-register a text registration file to be registered, the position where the registration file is possibly registered needs to be classified, the text to be classified is input into a target text classification model, the target classification model is trained through a large amount of standard training data, the position where the input text to be classified possibly appears can be pre-judged through semantic understanding, and when a newly-appeared three-party platform case appears, the user can also accurately prompt.
In the step 300, a segmentation symbol identifier is added to a designated position of the text to be classified, and the text to be queried after the segmentation symbol identifier is added is input into the target text classification model; and performing correlation matching on the text to be classified and the registered text file according to the target text classification model. It can be understood that the information to be input into the background of the server may be segmented according to the type and position of the input text, or the text to be classified is segmented into a plurality of short texts corresponding to the text to be input according to a predetermined length. For example, when the input query information includes feature information of a plurality of registration files, segmenting the input text at corresponding feature positions according to information types or semantics according to the registration files, further inputting the segmented text into a target text classification model for correlation matching, specifically, the target text classification model can effectively capture context information of the text to be classified, identifying ambiguous words in the query text, that is, a plurality of word vector sequences corresponding to the segmented text can be accurately obtained through fine tuning the model, and then a plurality of feature vectors corresponding to the word vector sequences are generated; and then obtaining a classification result corresponding to the text to be classified according to the feature vector, and specifically judging the registration position of the registration file according to a preset similarity threshold.
Step 400: and judging whether the text to be classified belongs to a server background or a third-party server according to the text classification result, marking the text to be classified as the text to be participled and performing word segmentation under the condition that the text to be classified is matched with the server background, wherein the obtained word segmentation result comprises word vectors corresponding to the participles. It is understood that redundant information, default values, noise, and the like exist in the real text information data. All data in the application is unstructured data, so that data preprocessing is an essential link of the whole classification model. And in the data preprocessing step, preprocessing operations such as word segmentation, stop word removal and the like are performed on the text. Word segmentation is an indispensable operation in the text preprocessing process, which converts continuous text into a set of words. The word segmentation tool is used in the application to perform word segmentation on the speech by using a hand or a jieba word segmentation tool. And removes some meaningless stop words in the text, which carry little information and reflect the grammatical structure of the sentence, such as "the", "get", "this", "that", and so on. And segmenting the text, traversing the list, calculating cosine similarity of the two words one by one through a model, and if the cosine similarity exceeds a preset threshold, determining that the similar words are matched.
Before the step 400, text mapping is performed on the text to be participled according to a preset text query dictionary. It can be understood that the registration position of the registration file can be further judged to be divided into a server background according to a preset similarity threshold, at this time, a dictionary of similar words and associated words can be preset according to text information possibly stored in the server background, text mapping is performed through the preset text query dictionary and input text information, whether the text information is already included in the preset text query dictionary is queried, so as to perform preliminary screening, when the information is completely consistent, the text registration is judged to be repeated information of the server background, otherwise, a pre-trained offline model is switched to perform further judgment. 500: and calculating the similarity between the word vector and a word vector model corresponding to the word in the preset target word list, and outputting corresponding category information. It can be understood that the input query range is limited to the text to be participled in the server background, and in the case that no text mapping is implemented, it indicates that there is no simple repeated file registration repetition, at this time, the text to be participled needs to be further converted into a word vector, the word vector is subjected to similarity calculation with a preset target word list stored in the offline model after pre-training, and in the case that the final text similarity is greater than the preset similarity, the file is considered to be a file already registered in the server background.
It can be understood that under the condition that the text is mapped to be empty, the text to be participled is participled through a probability statistics Word segmentation model to obtain Word segmentation results corresponding to the text to be participled, and Word vector sequences corresponding to the participles in the Word segmentation results are obtained through a Word2Vec model for converting words into vectors. The words are characterized to be high-efficiency tools of real numerical value vectors through a Word2Vec model, all the participles included in the participle result are converted into Word vector sequences, and then special words, sensitive words, single words, special symbols and the like in the participle result can be effectively filtered, so that the accuracy of subsequently extracting the central thought of the text is improved.
In some embodiments of the present application, the word vector sequence is input into an offline model, and then the text similarity of the word vector sequence and a preset target category vocabulary trained by the offline model in a text space is calculated; and performing weight calculation on the word vector and the preset target category vocabulary based on the text similarity of the same terms and word frequency and the text similarity in a text space to obtain the final similarity of the word vector and the preset target category vocabulary. It can be understood that, when cosine similarity is used as a measure, the value range is [0,1], the larger the numerical value is, the higher the similarity is included, a threshold value may be set by a user, for example, when the numerical value exceeds 0.8, the registration of words may appear in the search result, and further, a weight calculation is performed based on the text to be segmented and the preset target category vocabulary of the offline model based on the text similarity of the same term, the word frequency, and the text similarity of the text space to obtain the final text similarity.
In some embodiments of the application, pre-training sentences are obtained and word segmentation is performed to obtain corresponding word segmentation sets, and the word segmentation sets are processed by using unique hot coding to obtain word vector sets; obtaining central words of a word sequence corresponding to a pre-training sentence, and determining central word vectors corresponding to the central words; and inputting the word vector set into an off-line model which is trained in advance, and keeping a central word vector sequence corresponding to the central word as a preset target category word list. It is understood that the pre-training sentence may be obtained based on the first text set, or may be obtained according to all text information input by the server in the background, which is not limited herein. In some embodiments of the present application, the offline model for word vector input may adopt a continuous word bag CBOW or skip-gram structure in the word2vec algorithm for the word embedding model to implement word segmentation, and specifically, a central word may be predicted according to surrounding words, which is not limited herein.
In some embodiments of the application, when the Skip-gram model is adopted to calculate keyword similar words by using the Skip-gram model, a context word of a central word can be captured by using a sliding window with a preset size by determining one keyword in a keyword set as the central word, and a context word with the central word as a center and the length in the size range of the sliding window is generated. Assuming that there are 5 words, [ "She", "likes", "animals" ], the sliding window size is skip-window ═ 1, the central word is "likes", and the context words are obtained with a distance of not more than 1 from it, then the context words are: "She", "animals".
Specifically, the central word and each context word are subjected to one-hot (one-hot) encoding to form a word matrix, and a word vector set is obtained. For example, the one-hot encoding of the [ "She", "likes", "animals" ] initializes the participle vector primarily using an N-bit state register to encode N states, denoted as N ═ 0,0, …,1, …,0,0}, each state being represented by its own independent register bit, and only one bit being active at any time, as can be given specifically in table 1:
TABLE 1 unique thermal coding
she [1,0,0,]
likes [0,1,0,]
animals [0,0,1]
And forming a word matrix by the word vectors obtained after coding. And taking out word vectors of the central words according to the central words and each contextual word, wherein the loss function adopts a classification loss function hierarchy soft-max loss function suitable for more types. The word vectors of the intermediate process are retained as a numerical representation of the trained word vectors. And then calculating the text similarity of the word vector and the preset target category word list trained by the offline model in a text space based on a cosine similarity algorithm, initializing a weight matrix between a mapping layer and a hidden layer of the word embedding model, performing weight calculation on the text similarity of the text space to obtain a calculation result of the central word and each contextual word, further obtaining the final similarity of the word vector and the preset target category word list, and finally and accurately judging whether the input text to be participled is a file registered by a server background according to the similarity.
In some embodiments of the present application, an electronic device is also provided. The electronic device comprises a memory for storing a processing program and a processor, wherein the processor implements the text query method as any item when executing the processing program.
In some embodiments of the present application, there is further provided a readable storage medium, on which a processing program is stored, and the processing program, when executed by a processor, implements the text query method of any one of the above.
In some embodiments of the present application, specifically, as shown in fig. 3, the input query text may be feature information of a text to be registered, specifically, a preliminary judgment on whether the registered text belongs to a third-party server is implemented through a target text classification model, and when the text to be queried belongs to a server background registration, a further accurate judgment on a possible occurrence of an approximate text/similar text input is performed through an offline model.
It is understood that the text query method implementation process can be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
A computer readable storage medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable storage medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
The program code in which aspects disclosed herein are implemented may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the C language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).
The foregoing description of specific embodiments of the present invention has been presented. It is to be understood that the present invention is not limited to the specific embodiments described above, and that various changes or modifications may be made by one skilled in the art within the scope of the appended claims without departing from the spirit of the invention. The embodiments and features of the embodiments of the present application may be combined with each other arbitrarily without conflict.
The embodiments of the present disclosure are described in detail above with reference to the drawings, but the present disclosure is not limited to the above embodiments. Even if various changes are made to the present disclosure, the changes are still within the scope of the present disclosure if they fall within the scope of the claims of the present disclosure and their equivalents.
In summary, according to the technical scheme provided by the application, the target text classification model is trained through a large number of preset standard training data sets, so that the contents of the registered files pre-registered in the server background or the registered files which are already registered are accurately understood semantically, and the registered files appearing in the third-party server are accurately queried. Furthermore, the method has the advantages that the similar word expressions of the registered files are accurately screened by segmenting the belonging text to be segmented and combining the offline model which is trained by the server background, whether the similar expressions are the same text expression or not is accurately judged, the misjudgment rate of the repeated registration of the registered files is reduced, the repeatedly registered text files are timely screened out, the repeated registration of users is avoided, and the registration accuracy and efficiency of the registered files are improved
In the systems and methods of the present disclosure, it is apparent that individual components or steps may be broken down and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure. Also, the steps of executing the series of processes described above may naturally be executed chronologically in the order described, but need not necessarily be executed chronologically. Some steps may be performed in parallel or independently of each other.

Claims (12)

1. A text query method is characterized by comprising the following steps:
acquiring a standard training data set, wherein the training data set comprises a positive sample and a negative sample, and the positive sample comprises a first text set which is pre-stored in a server background; the negative examples comprise a second set of text pre-stored at the at least one third-party server;
training a target text classification model based on the standard training data set;
calling the target text classification model, inputting the text to be classified into the target text classification model for correlation matching, and acquiring a text classification result output by the target text classification model;
judging whether the text to be classified belongs to a server background or a third-party server according to the text classification result, marking the text to be classified as a text to be participled and performing word segmentation under the condition that the text to be classified is matched with the server background, wherein the obtained word segmentation result comprises word vectors corresponding to the participles;
and calculating the similarity between the word vector and a word vector model corresponding to a word in a preset target word list, and outputting corresponding category information.
2. The method of claim 1, wherein training a target text classification model based on the standard training dataset comprises:
obtaining a loss value comprising the first text set based on a first loss function, and obtaining a loss value comprising the second text set based on a second loss function;
determining a loss value for the standard training data set based on a loss value comprising the first set of text and a loss value comprising the second set of text;
and adjusting the model parameter value of the preset Bert model by using the loss value of the standard training data set, and training the target text classification model.
3. The method of claim 2, wherein the adjusting the model parameter values of the preset Bert model using the loss values of the standard training data set comprises:
on the basis of the preset Bert model training, obtaining sentence vectors and part-of-speech vectors of the text information of the standard training data set according to a preset rule, and correspondingly updating parameter values of the structures of the full connection layer and the output layer to form an adjustment model;
and performing iterative training on the adjustment model according to the loss function, and calculating a loss function value in the parameter value iterative training process.
4. The method of claim 3, wherein training the target text classification model comprises:
under the condition that the numerical values of the first loss function and the second loss function are converged, judging whether the numerical values of the first loss function and the second loss function are smaller than a preset threshold value or not;
if yes, counting the accuracy of the test texts in the standard training data set;
and under the condition that the accuracy of the test text is greater than the preset accuracy, ending the training process, and saving the model structure as the target text classification model.
5. The method of claim 4, wherein iteratively training the adjustment model according to the loss function comprises:
adding a Softmax function to the preset Bert model output layer;
classifying the feature vectors corresponding to the preset labeled training data set by a Softmax function to obtain the loss function, and performing iterative training on the adjustment model according to the loss function.
6. The method of claim 1, wherein the entering the text to be classified into the target text classification model for relevance matching comprises:
adding a segmentation symbol identifier at the designated position of the text to be classified, and inputting the text to be inquired after the segmentation symbol identifier is added into the target text classification model;
and performing correlation matching on the text to be classified and the registered text file according to the target text classification model.
7. The method of claim 1, wherein before segmenting the text to be segmented, the method comprises:
and performing text mapping on the text to be segmented according to a preset text query dictionary.
8. The text query method according to claim 7, wherein the step of performing word segmentation on the text to be word segmented to obtain word segmentation results including word vector sequences corresponding to the word segmentation comprises:
under the condition that the text mapping is empty, performing word segmentation on the text to be word segmented through a word segmentation model based on probability statistics to obtain a word segmentation result corresponding to the text to be word segmented;
and obtaining a Word vector sequence corresponding to each participle in the participle result through a Word2Vec model for converting the participle into the vector.
9. The method of claim 1, wherein calculating the similarity between the word vector and the word vector model corresponding to the word in the preset target vocabulary comprises:
inputting the word vector sequence into an offline model, and calculating the text similarity of the word vector and the preset target category word list trained by the offline model in a text space;
and performing weight calculation on the word vector and the preset target category word list based on the text similarity of the same terms and word frequency and the text similarity in a text space to obtain the final similarity of the word vector and the preset target category word list.
10. The method of claim 9, wherein obtaining the preset target category vocabulary trained by the offline model comprises:
obtaining pre-training sentences for word segmentation to obtain corresponding word segmentation sets, and obtaining word vector sets by using the word segmentation sets through unique hot coding;
obtaining a central word of a word sequence corresponding to a pre-training sentence, and determining a central word vector corresponding to the central word;
and inputting the word vector set to the off-line model which is trained in advance, and keeping a central word vector sequence corresponding to the central word as the preset target category word list.
11. An electronic device, comprising:
a memory for storing a processing program;
a processor implementing the text query method of any one of claims 1 to 10 when executing the handler.
12. A readable storage medium, having a processing program stored thereon, the processing program, when executed by a processor, implementing the text query method according to any one of claims 1 to 10.
CN202111663305.7A 2021-12-30 2021-12-30 Text query method, text query equipment and storage medium Pending CN114416979A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111663305.7A CN114416979A (en) 2021-12-30 2021-12-30 Text query method, text query equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111663305.7A CN114416979A (en) 2021-12-30 2021-12-30 Text query method, text query equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114416979A true CN114416979A (en) 2022-04-29

Family

ID=81270603

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111663305.7A Pending CN114416979A (en) 2021-12-30 2021-12-30 Text query method, text query equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114416979A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115759734A (en) * 2022-10-19 2023-03-07 国网物资有限公司 Index-based power service supply chain monitoring method, device, equipment and medium
CN116010602A (en) * 2023-01-10 2023-04-25 孔祥山 Data optimization method and system based on big data
CN116029291A (en) * 2023-03-29 2023-04-28 摩尔线程智能科技(北京)有限责任公司 Keyword recognition method, keyword recognition device, electronic equipment and storage medium
CN117292338A (en) * 2023-11-27 2023-12-26 山东远东保险公估有限公司 Vehicle accident identification and analysis method based on video stream analysis

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115759734A (en) * 2022-10-19 2023-03-07 国网物资有限公司 Index-based power service supply chain monitoring method, device, equipment and medium
CN115759734B (en) * 2022-10-19 2024-01-12 国网物资有限公司 Index-based power service supply chain monitoring method, device, equipment and medium
CN116010602A (en) * 2023-01-10 2023-04-25 孔祥山 Data optimization method and system based on big data
CN116010602B (en) * 2023-01-10 2023-09-29 湖北华中电力科技开发有限责任公司 Data optimization method and system based on big data
CN116029291A (en) * 2023-03-29 2023-04-28 摩尔线程智能科技(北京)有限责任公司 Keyword recognition method, keyword recognition device, electronic equipment and storage medium
CN117292338A (en) * 2023-11-27 2023-12-26 山东远东保险公估有限公司 Vehicle accident identification and analysis method based on video stream analysis
CN117292338B (en) * 2023-11-27 2024-02-13 山东远东保险公估有限公司 Vehicle accident identification and analysis method based on video stream analysis

Similar Documents

Publication Publication Date Title
CN109992782B (en) Legal document named entity identification method and device and computer equipment
CN113011533B (en) Text classification method, apparatus, computer device and storage medium
CN110727779A (en) Question-answering method and system based on multi-model fusion
CN110781276A (en) Text extraction method, device, equipment and storage medium
CN114416979A (en) Text query method, text query equipment and storage medium
CN111966812B (en) Automatic question answering method based on dynamic word vector and storage medium
CN112906397B (en) Short text entity disambiguation method
CN110895559A (en) Model training method, text processing method, device and equipment
CN111191442A (en) Similar problem generation method, device, equipment and medium
CN111368542A (en) Text language association extraction method and system based on recurrent neural network
CN114218945A (en) Entity identification method, device, server and storage medium
CN113836896A (en) Patent text abstract generation method and device based on deep learning
CN116416480A (en) Visual classification method and device based on multi-template prompt learning
CN115098673A (en) Business document information extraction method based on variant attention and hierarchical structure
CN112528653B (en) Short text entity recognition method and system
CN116628173B (en) Intelligent customer service information generation system and method based on keyword extraction
CN115906835B (en) Chinese question text representation learning method based on clustering and contrast learning
CN115130475A (en) Extensible universal end-to-end named entity identification method
CN114398482A (en) Dictionary construction method and device, electronic equipment and storage medium
CN113868389A (en) Data query method and device based on natural language text and computer equipment
CN114254622A (en) Intention identification method and device
CN114417872A (en) Contract text named entity recognition method and system
CN112949313A (en) Information processing model training method, device, equipment and storage medium
CN112989839A (en) Keyword feature-based intent recognition method and system embedded in language model
Tun et al. Intent Classification on Myanmar Social Media Data in Telecommunication Domain Using Convolutional Neural Network and Word2Vec

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination