CN117332038A - Text information detection method, device, equipment and storage medium - Google Patents

Text information detection method, device, equipment and storage medium Download PDF

Info

Publication number
CN117332038A
CN117332038A CN202311214190.2A CN202311214190A CN117332038A CN 117332038 A CN117332038 A CN 117332038A CN 202311214190 A CN202311214190 A CN 202311214190A CN 117332038 A CN117332038 A CN 117332038A
Authority
CN
China
Prior art keywords
word
detection
sequence
text
weight
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311214190.2A
Other languages
Chinese (zh)
Inventor
方滨兴
张民
贾焰
顾钊铨
张欢
李晶
陈科海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Peng Cheng Laboratory
Original Assignee
Peng Cheng Laboratory
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peng Cheng Laboratory filed Critical Peng Cheng Laboratory
Priority to CN202311214190.2A priority Critical patent/CN117332038A/en
Publication of CN117332038A publication Critical patent/CN117332038A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the application provides a text information detection method, a device, equipment and a storage medium, wherein at least one word sequence of a text to be detected is obtained; then generating a weight sequence of the word sequence based on word weight of the text word in the word sequence, selecting a mask word from the text word according to the weight sequence, generating a mask sequence of the word sequence according to the mask word, inputting the mask sequence into at least one first detection model for first detection processing to obtain a first detection score of the mask sequence, and obtaining a detection probability vector of the mask sequence based on the first detection score; and sequentially inputting the detection probability vector corresponding to each word sequence into a second detection model to carry out second detection processing, obtaining a second detection score, and obtaining a detection result of the text to be detected based on the second detection score, thereby improving the accuracy of text information detection.

Description

Text information detection method, device, equipment and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a storage medium for detecting text information.
Background
Currently, the industry and academia at home and abroad are widely researching large-scale text models, and various different architectures and large-scale text models aiming at different vertical fields are emerging. Large text models can generate high quality text, but cannot guarantee that the generated content meets social criteria, legal regulations, and ethical standards. Thus, with the widespread use of large text models, efficient compliance detection of the text content they generate is necessary.
In the related art, compliance detection for text content typically employs pattern matching and formulation of relevant rules to identify illegal words and sensitive information in the text content. However, the text generated by the large-scale language model has high quality, reaches the level of common human writing, adopts a detection method of pattern matching and related rule formulation to carry out compliance detection on the text generated by the large-scale language model, and has lower detection accuracy.
Disclosure of Invention
The embodiment of the application provides a text information detection method, device, equipment and storage medium, which can improve the accuracy of text information detection.
In order to achieve the above object, a first aspect of an embodiment of the present application provides a text information detection method, including:
Acquiring at least one word sequence of a text to be tested, wherein the word sequence comprises at least one text word;
generating a weight sequence of the word sequence based on the word weight of the text word in the word sequence, selecting a mask word from the text word according to the weight sequence, and generating a mask sequence of the word sequence according to the mask word;
inputting the mask sequence into at least one first detection model for first detection processing to obtain a first detection score of the mask sequence, and obtaining a detection probability vector of the mask sequence based on at least one first detection score;
and sequentially inputting the detection probability vectors corresponding to each word sequence into a second detection model to carry out second detection processing, obtaining second detection scores consistent with the number of the word sequences, and obtaining a detection result of the text to be detected based on the second detection scores.
In some embodiments, the generating a weighted sequence of the word sequence based on the word weights of the text words in the word sequence comprises:
obtaining the text weight of each text word based on the word frequency of each text word in the word sequence;
Selecting different text words to form a text sequence according to the sequence of the text words;
and generating a weight element of each text word in the weight sequence based on the text word sequence.
In some embodiments, the generating a weight element for each of the text words in the weight sequence based on the text word order includes:
generating the weight element of the first text word according to the text weight of the first text word;
and adding the weight element of the previous text word and the text weight of the current text word based on the text word sequence to obtain the weight element of the current text word.
In some embodiments, the selecting a mask word from the text words according to the weight sequence includes:
generating at least one random number according to the number of weight elements of the weight sequence;
traversing the weight elements in the weight sequence, and selecting the smallest weight element in the weight elements which are larger than or equal to the random number as a target weight;
and selecting the text word corresponding to the target weight as the mask word.
In some embodiments, when the target weights corresponding to at least two random numbers are the same, the selecting the text word corresponding to the target weight as the mask word includes:
Selecting the target weight corresponding to the previous random number as a candidate weight based on the selection sequence of the random number;
taking the weight element of the candidate weight which is the next weight as the target weight corresponding to the next random number;
and respectively selecting the text words corresponding to the candidate weight and the target weight as the mask words.
In some embodiments, the first detection model is a plurality of, the inputting the mask sequence into at least one first detection model performs a first detection process to obtain a first detection score of the mask sequence, and obtaining a detection probability vector of the mask sequence based on at least one first detection score includes:
vectorizing the mask sequence to obtain a mask vector;
the mask vector is respectively input into a plurality of first detection models to carry out first detection processing, and a first probability score and a second probability score output by each first detection model are obtained;
obtaining a first detection score of the mask sequence according to the first probability score and the second probability score;
and splicing the plurality of first detection scores to obtain the detection probability vector of the mask sequence.
In some embodiments, the mask sequence of each word sequence is a plurality of, and the sequentially inputting the detection probability vector corresponding to each word sequence into a second detection model to perform a second detection process, so as to obtain a second detection score consistent with the number of the word sequences, including:
sequentially inputting a plurality of detection probability vectors corresponding to each word sequence into a second detection model to carry out second detection processing, so as to obtain a third detection score of each detection probability vector;
accumulating the plurality of third detection scores to obtain detection accumulated scores;
and calculating the average value of the detection accumulation scores to obtain a second detection score of the word sequence.
In some embodiments, the obtaining at least one word sequence of the text under test includes:
word segmentation is carried out on the text to be detected to obtain a word sequence to be detected;
removing the stop words contained in the word sequence to be detected based on the stop word list to obtain a simplified word sequence;
and dividing the simplified word sequence to obtain at least one word sequence.
In some embodiments, the obtaining the detection result of the text to be detected based on the second detection score includes:
Obtaining a first adjustment parameter according to a preset first weight and a plurality of first detection scores;
obtaining a second adjustment parameter according to a preset second weight and a plurality of third detection scores;
adjusting a preset threshold according to the first adjustment parameter and the second adjustment parameter;
and obtaining a detection result according to the comparison result of the second detection score and the preset threshold value.
To achieve the above object, a second aspect of the embodiments of the present application provides a text information detection device, including:
the acquisition module is used for acquiring at least one word sequence of the text to be detected;
the mask processing module is used for generating a weight sequence of the word sequence based on the word weight of the text word in the word sequence, selecting mask words from the text word according to the weight sequence, and generating a mask sequence of the word sequence according to the mask words;
the first detection module is used for inputting the mask sequence into at least one first detection model to perform first detection processing to obtain a first detection score of the mask sequence, and obtaining a detection probability vector of the mask sequence based on at least one first detection score;
And the second detection module is used for sequentially inputting the detection probability vectors corresponding to each word sequence into a second detection model for second detection processing to obtain second detection scores consistent with the number of the word sequences, and obtaining a detection result of the text to be detected based on the second detection scores.
To achieve the above object, a third aspect of the embodiments of the present application provides an electronic device, where the electronic device includes a memory and a processor, where the memory stores a computer program, and the processor executes the computer program to implement the text information detection method described in the first aspect.
To achieve the above object, a fourth aspect of the embodiments of the present application proposes a storage medium, which is a computer-readable storage medium storing a computer program that when executed by a processor implements the text information detection method described in the first aspect.
The text information detection method, device, equipment and storage medium provided by the embodiment of the application are characterized by obtaining at least one word sequence of a text to be detected, wherein the word sequence comprises at least one text word; then generating a weight sequence of the word sequence based on word weights of the text words in the word sequence, selecting mask words from the text words according to the weight sequence, and generating a mask sequence of the word sequence according to the mask words; inputting the mask sequence into at least one first detection model for first detection processing to obtain a first detection score of the mask sequence, and obtaining a detection probability vector of the mask sequence based on the at least one first detection score; and sequentially inputting the detection probability vectors corresponding to each word sequence into a second detection model to carry out second detection processing, obtaining second detection scores consistent with the number of the word sequences, and obtaining a detection result of the text to be detected based on the second detection scores. According to the embodiment of the application, aiming at text information detection, a weight sequence is generated by using word weights, then a mask sequence is generated according to the weight sequence, which is different from a mode of randomly generating masks, so that the randomness of generating masks can be improved, and then the mask sequence is sequentially input into at least one first detection model and at least one second detection model to obtain the detection result of the word sequence, so that the accuracy of text information detection is improved, and further, the language model is prevented from outputting text contents which do not meet social criteria, laws and regulations and moral standards.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Drawings
Fig. 1 is a flowchart of a text information detection method according to an embodiment of the present application.
Fig. 2 is a flowchart of step S101 in fig. 1.
Fig. 3 is a flowchart of step S102 in fig. 1.
Fig. 4 is a flowchart of step S303 in fig. 3.
Fig. 5 is a further flowchart of step S102 in fig. 1.
Fig. 6 is a flowchart of step S503 in fig. 5.
Fig. 7 is a flowchart of step S103 in fig. 1.
Fig. 8 is a flowchart of step S104 in fig. 1.
Fig. 9 is a further flowchart of step S104 in fig. 1.
Fig. 10 is a schematic diagram of an iterative process of a text information detection method according to another embodiment of the present application.
Fig. 11 is a schematic flow chart diagram of a text information detection method according to another embodiment of the present application.
Fig. 12 is a schematic hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.
First, several nouns referred to in this application are parsed:
artificial intelligence is a new technical science to study, develop theories, methods, techniques and application systems for simulating, extending and expanding human intelligence; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the consciousness, thinking and other information of a target object. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Machine learning is a branch of the field of artificial intelligence, which enables predictions and decisions on unknown data by letting a computer learn from the data and automatically extract rules and patterns. The goal of machine learning is to enable prediction or classification of new input data by training a model. The basic principle of machine learning is to analyze and process large amounts of data, find relationships and patterns between the data, and use these relationships and patterns for prediction of unknown data. The machine learning algorithm can perform tasks such as classification, regression, clustering, dimension reduction and the like according to the characteristics and the labels of the data.
Neural networks are a type of machine learning that models the human brain. The neural network is capable of performing deep learning. The basic component of an artificial neural network is a sensor that can perform simple signal processing and then connect to a large mesh network.
Currently, the industry and academia at home and abroad are widely researching large-scale text models, and various different architectures and large-scale text models aiming at different vertical fields are emerging. Large text models can generate high quality text, but cannot guarantee that the generated content meets social criteria, legal regulations, and ethical standards. Thus, with the widespread use of large text models, efficient compliance detection of the text content they generate is necessary.
In the related art, compliance detection for text content typically employs pattern matching and formulation of relevant rules to identify illegal words and sensitive information in the text content. However, the text generated by the large-scale language model has high quality, the level of common human writing is reached, the compliance detection is carried out on the text generated by the large-scale language model by adopting a detection method of pattern matching and related rule formulation, the detection accuracy is low, and the large-scale language model is enabled to output text contents which do not accord with social criteria, laws and regulations and ethical standards.
Based on the above, the embodiment of the application provides a text information detection method, a device, equipment and a storage medium, which can improve the accuracy of text information detection. The text information detection method mainly comprises the steps of obtaining at least one word sequence of a text to be detected, wherein the word sequence comprises at least one text word; then generating a weight sequence of the word sequence based on word weights of the text words in the word sequence, selecting mask words from the text words according to the weight sequence, and generating a mask sequence of the word sequence according to the mask words; inputting the mask sequence into at least one first detection model for first detection processing to obtain a first detection score of the mask sequence, and obtaining a detection probability vector of the mask sequence based on the at least one first detection score; and sequentially inputting the detection probability vectors corresponding to each word sequence into a second detection model to carry out second detection processing, obtaining second detection scores consistent with the number of the word sequences, and obtaining a detection result of the text to be detected based on the second detection scores. According to the embodiment of the application, aiming at text information detection, a weight sequence is generated by using word weights, then a mask sequence is generated according to the weight sequence, which is different from a mode of randomly generating masks, so that the randomness of generating masks can be improved, and then the mask sequence is sequentially input into at least one first detection model and at least one second detection model to obtain the detection result of the word sequence, so that the accuracy of text information detection is improved, and further, the language model is prevented from outputting text contents which do not meet social criteria, laws and regulations and moral standards.
The embodiment of the application provides a text information detection method, a device, equipment and a storage medium, and specifically, the text information detection method in the embodiment of the application is described first through the following embodiment.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Wherein artificial intelligence is the intelligence of simulating, extending and expanding a person using a digital computer or a machine controlled by a digital computer, sensing the environment, obtaining knowledge, and using knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The text information detection method provided by the embodiment of the application relates to the technical field of artificial intelligence, in particular to the field of data detection. The text information detection method provided by the embodiment of the application can be applied to a terminal, a server side and a computer program running in the terminal or the server side. For example, the computer program may be a native program or a software module in an operating system; the Application may be a local Application (APP), i.e. a program that needs to be installed in an operating system to run, such as a client that supports text translation, or an applet, i.e. a program that needs to be downloaded to a browser environment to run; but also an applet that can be embedded in any APP. In general, the computer programs described above may be any form of application, module or plug-in. Wherein the terminal communicates with the server through a network. The text information detection method may be performed by a terminal or a server, or by a terminal and a server in cooperation.
In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, or smart watch, or the like. In addition, the terminal can also be an intelligent vehicle-mounted device. The intelligent vehicle-mounted equipment provides relevant services by applying the text information detection method of the embodiment, and driving experience is improved. The server can be an independent server, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDNs), basic cloud computing services such as big data and artificial intelligent platforms, and the like; or may be service nodes in a blockchain system, where Peer To Peer (P2P) networks are formed between the service nodes, and the P2P protocol is an application layer protocol that runs on top of a transmission control protocol (Transmission Control Protocol, TCP) protocol. The server may be provided with a server of the text translation system, through which interaction with the terminal may be performed, for example, the server may be provided with corresponding software, which may be an application for implementing a text information detection method, etc., but is not limited to the above form. The terminal and the server may be connected through communication connection modes such as bluetooth, universal serial bus (Universal Serial Bus, USB) or network, which is not limited herein.
The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network personal computers (Personal Computer, PCs), minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
First, a text information detection method in an embodiment of the present application is described. In the present embodiment, the text information detecting method is applicable to a text information detecting apparatus. Referring to fig. 1, an optional flowchart of a text information detection method according to an embodiment of the present application is provided, where the method in fig. 1 may include, but is not limited to, steps S101 to S104. It should be understood that the order of steps S101 to S104 in fig. 1 is not particularly limited, and the order of steps may be adjusted, or some steps may be reduced or added according to actual requirements.
Step S101: at least one word sequence of the text to be tested is obtained.
In some embodiments, the text to be tested includes text words. The text to be tested refers to text data to be processed or analyzed, and can be one or more sections of characters, a document, an article, a mail and the like. In this embodiment, the source of the text to be tested is not limited, and the text can be manually input, or generated by some text generation models based on machine learning models, or extracted from a text database through computer equipment, or crawled from a network through computer equipment, or the like.
In some embodiments, after the text detection device obtains the text to be detected, in order to avoid processing the text information too long at the same time, a corresponding preprocessing operation needs to be performed on the text to be detected. The following describes a process of preprocessing a text to be tested according to the embodiment of the present application.
Thus, referring to fig. 2, the process of preprocessing the text to be tested to obtain at least one word sequence includes steps S201 to S203.
Step S201: and segmenting the text to be detected to obtain a word sequence to be detected.
In some embodiments, after the text to be tested is obtained, word segmentation is performed on the text to be tested to obtain a word sequence to be tested. The word sequence to be tested comprises all text words in the text to be tested. It will be understood that word segmentation refers to a process of segmenting a continuous text to be tested into text words with independent meanings, and in this embodiment, the tools used for word segmentation are not limited, that is, chinese word segmentation tools "jieba", "pkuseg" and the like may be used, and english word segmentation tools "NLTK", "spaCy" and the like may also be used.
Step S202: and removing the stop words contained in the word sequence to be detected based on the stop word list to obtain a simplified word sequence.
In some embodiments, after obtaining the word sequence to be tested, traversing text words in the word sequence to be tested, and then removing stop words contained in the word sequence to be tested to form a simplified word sequence with the stop words removed, wherein the simplified word sequence is expressed as X= [ w ] 1 ,w 2 ,...,w n ]Where n represents the number of text words. Thereby removing redundant information of the word sequence to be detected, and reserving effective text word information so as to improve the working efficiency and the accuracy of text detection. It will be appreciated that the stop word refers to an unintended word or other actual word in the source language sentence Words with smaller effects, such as prepositions, conjunctions, pronouns and the like, the stop word list in the embodiment of the application can be obtained according to a preset stop word library or a public network.
Step S203: the simplified word sequence is partitioned to obtain at least one word sequence.
In some embodiments, after the simplified word sequence is obtained, in order to avoid problems of overlong time training of the appearance model, disappearance of training gradient in the model training process, slow updating of relevant parameters in the model training process, and the like caused by the fact that the text detection device processes the overlong word sequence simultaneously. The reduced word sequence is divided into at least one word sequence, wherein the word sequence comprises at least one text word. Thereby improving the working efficiency and accuracy of text detection.
In some embodiments, the text detection means segments the reduced word sequence into a plurality of word sub-sequences in a front-to-back order such that each word sub-sequence is no longer than L in length. Wherein L is a super parameter, and the value interval is [200,500 ]]. For simplified word sequences z= [ w ] 1 ,w 2 ,...,w n ]When the length of the simplified word sequence is smaller than or equal to L, namely n m is smaller than or equal to L, the word sequence is the simplified word sequence per se; when the length of the reduced word sequence is greater than L, i.e. if n >L, then X is split into multiple word sequences X 1 =[w 1 ,w 2 ,...,w L ],...,Z k =[w (k-1)L+1 ,w (k-1)L+2 ,...,w n ]WhereinRepresenting the total number of word sequences, it will be appreciated that +.>Representing an upward rounding.
Step S102: generating a weight sequence of the word sequence based on word weights of the text words in the word sequence, selecting mask words from the text words according to the weight sequence, and generating a mask sequence of the word sequence according to the mask words.
In some embodiments, in acquiring at least one word sequence, in order to avoid that the meaning of the text information is affected by the meaning of the text word, the text information detecting device needs to perform c masking operations on the word sequence, so as to generate c masking sequences corresponding to the word sequence, so that the subsequent text information detecting device can understand the text information to be detected through the preceding text word information and the following text word information of the masked text word, and thus the situation that the meaning of the text information is affected by the meaning of the text word is affected by the text information detecting device, so as to improve the accuracy and reliability of text information detection. It is understood that c is a super parameter, and the value interval is c is equal to or greater than 1. The proper selection of the value of c can effectively relieve the error of the detection result of the text to be detected due to the randomness of sampling, but at the same time, the selection of the value of c is not recommended to be too large, and the processing efficiency of text information detection is reduced due to the too large value of c. In each masking operation, the text detection means replaces a% of text words in the selected word sequence with nonsensical symbols, wherein a is a super parameter and the value interval is [10,20].
In order to avoid that the occurrence frequency of a certain text word in a text sequence is too high, the text word is easily selected repeatedly as a target mask word, so that the accuracy of text information detection is inevitably reduced. It is necessary to generate a suitable weight sequence in order to select a suitable target mask word.
The following describes a process of generating a weight sequence corresponding to a word sequence in the text information detection method provided in the embodiment of the present application.
Thus, referring to fig. 3, a weighted sequence of word sequences is generated based on the word weights of the text words in the word sequences, including steps S301 through S303.
Step S301: based on the word frequency of each text word in the word sequence, the text weight of each text word is obtained.
In some embodiments, in each masking operation, a text weight of e for each text word is derived based on the word frequency of each text word in the word sequence i =count(w i )/L,Wherein count (w i ) Representing word sequence X j Chinese text word w i Is a total number of (a) in the number of (a).
Step S302: and selecting different text words to form a text sequence according to the sequence of the text words.
In some embodiments, the text information detecting means generates a weight sequence corresponding to each word sequence for each word sequence obtained after the division. In the process of generating the weight sequence, the text information detection device firstly obtains the text weight of each text word based on the word frequency of each text word in the word sequence, then selects text words with different word sequences, and forms the text sequence according to the sequence of the text words in the word sequence so as to facilitate the subsequent generation of the weight sequence corresponding to the word sequence.
In some embodiments, after obtaining the text weight of each text word, the text information detection device then selects the text word with different word sequences to form the text sequence according to the word frequency of each text word in the word sequence from high to low, so as to facilitate the subsequent generation of the weight sequence corresponding to the word sequence.
Step S303: a weight element is generated for each text word in the weight sequence based on the text word order.
In some embodiments, after the text information detecting device generates the text sequence, the weight element of each text word in the weight sequence is sequentially generated based on the sequence of the text words in the text sequence, so as to generate the weight sequence to facilitate the subsequent operation of generating the mask sequence. In order to avoid that a certain text word in a text sequence is too high in frequency, the text word is easily selected repeatedly as a target mask word, so that the accuracy of text information detection is inevitably reduced. Therefore, a more random weight sequence for selecting the target mask word needs to be generated so as to facilitate more accurate text information detection.
The following describes a process of selecting mask words in a word sequence in the text information detection method provided in the embodiment of the present application.
Thus, referring to fig. 4, a weight element for each text word in the weight sequence is generated based on the text word order, including steps S401 to S402.
Step S401: and generating a weight element of the first text word according to the text weight of the first text word.
Step S402: and adding the weight element of the previous text word with the text weight of the current text word based on the text word sequence to obtain the weight element of the current text word.
In some embodiments, after generating the text sequence, the text information detection device generates a weight element of a first text word according to the text weight of the first text word in the text sequence, then, based on the text word sequence, starting from a second text word, sums the weight element of the previous text word with the text weight of the current text word to obtain the weight element of the current text word, so as to generate a weight sequence with higher randomness for selecting the target mask word according to the weight accumulation operation result, thereby facilitating more accurate text information detection. Meanwhile, the situation that the text word is repeatedly selected as a target mask word due to the fact that the frequency of a certain text word in a text sequence is too high is avoided, and therefore the accuracy of text information detection is reduced is avoided.
In some embodiments, the cumulative weight refers to the number of words from word sequence X j Starting with the first non-repeated text word, calculating the Sum of the frequency of each text word and the frequency of all the previous text words as the accumulated weight of the text word to construct a weight sequence Sum e =[e (j-1)L+1 ,e (j-1)L+1 +e (j-1)L+2 ,…,1]Where the number of weight elements of the weight sequence is l, and it will be appreciated that since the last weight element of the weight sequence is the sum of the frequencies of all text words that accumulate the entire text sequence, the last weight element of the weight sequence has a value of 1.
In some embodiments, after the weight sequence is obtained, a mask sequence corresponding to the word sequence needs to be generated, and in order to make the randomness of selecting the target mask word higher, therefore, a suitable generation of the mask word needs to be performed.
Therefore, referring to fig. 5, a text word corresponding to the target weight is selected as a mask word, including steps S501 to S503.
Step S501: at least one random number is generated based on the number of weight elements of the weight sequence.
In some embodiments, the text information detecting means, after obtaining the weight sequence, generates from the number of weight elements of the weight sequenceRandom number of each, and the range of the random number is [0,1 ]It will be appreciated that +.>Representing a rounding down. This random number will be used to determine a target weight from the weight sequence so that the text word corresponding to the weight element is subsequently selected as a mask word according to the target weight.
Step S502: traversing weight elements in the weight sequence, and selecting the smallest weight element in the weight elements which are larger than or equal to the random number as the target weight.
Step S503: and selecting the text word corresponding to the target weight as a mask word.
In some embodiments, after obtaining at least one random number, the text information detection device traverses all weight elements in the weight sequence for each random number, then selects a smallest weight element in the weight elements greater than or equal to the random number as a target weight, and selects a text word corresponding to the target weight as a mask word, so that randomness of the selected target mask word is higher, a mask operation is convenient to follow-up, a mask sequence corresponding to the word sequence is generated, randomness of the generated mask sequence is higher, and accuracy and reliability of text information detection are further improved.
In some embodiments, when the number of random numbers is greater than one, in the process of generating the mask sequence at a time, selecting the target weights corresponding to the random numbers one by one, and finally obtaining the target weights corresponding to each random number, where it can be understood that the number of target weights corresponds to the number of random numbers; then, a plurality of mask words are selected according to the plurality of target weights and masking operation is carried out, so that a mask sequence is generated.
In some embodiments, the text information detection means, for each random number, is derived from a weight sequence Sum e Traversing from left to right, determining the first position exceeding the random number, and corresponding word sequence X j The text word at that position in (c) is determined to be a mask word.
In some embodiments, when the target weights corresponding to at least two random numbers are the same, in order to make the randomness of selecting the target mask words higher, thereby improving the accuracy and reliability of text information detection, the operation of selecting the target mask words needs to be further refined.
Therefore, referring to fig. 6, a text word corresponding to the target weight is selected as a mask word, and steps S60l to S602 are further included.
Step S60l: and selecting a target weight corresponding to the previous random number as a candidate weight based on the selection sequence of the random numbers.
Step S602: and taking the latter weight element of the candidate weight as the target weight corresponding to the latter random number.
Step S603: and respectively selecting text words corresponding to the candidate weights and the target weights as mask words.
In some embodiments, when the target weights corresponding to at least two random numbers are the same, the text information detection device will be based on the selection sequence of the random numbers, and in the process of selecting the target weight for the next random number: selecting a target weight corresponding to the previous random number as a candidate weight, and then taking a next weight element of the candidate weight as the target weight corresponding to the next random number, so that the final total number of mask words is Therefore, the randomness of the selected target mask words is higher, and the accuracy and reliability of text information detection are improved.
In some embodiments, the text information detecting means replaces all MASK words in the text sequence with "[ MASK ] after determining the MASK words in each generation of the MASK sequence for the word sequence, thereby generating the MASK sequence. It will be appreciated that c mask sequences corresponding to word sequences may be obtained by performing c operations on the word sequences to generate mask sequences.
In some embodiments, it is assumed that a certain word sequence X j The method comprises the following steps:
X j =[m,d,h,f,d,k,h,k,g,g,h,m,m,n,n,k,f,h,m,m]wherein the word sequence has a length of 20, in terms of X j The text sequence generated by the corresponding Chinese text word sequence is [ e ] m =5/20=0.25,e d =2/20=0.1,e h =4/20=0.2,e f =2/20=0.1,e k =3/20=0.15,e g =2/20=0.1,e n =2/20=0.1]Wherein the text sequence has a length of l=7.
From text sequences [ e ] m =0.25,e d =0.1,e h =0.2,e f =0.1,e k =0.15,e g =0.1,e n =0.1]And (3) carrying out weight accumulation to obtain a weight sequence as follows:
[e m =0.25,e d +e m =0.35,e h +e d +e m =0.55,e f +e h +e d +e m =0.65,e k +e f +e h +e d +e m =0.8,e g +e k +e f +e h +e d +e m =0.9,e n +e g +e k +e f +e h +e d +e m =1]next, the super parameter a=15 is selected, i.e. the random number to be selected isNamely, a random number is required to be selected, the selected random number is assumed to be 0.64, the weight sequence is traversed, the weight element which is greater than or equal to 0.64 is determined to be a fourth weight element, the corresponding text word is "f", and the weight sequence is determined to be "f" in the word sequence " f' is subjected to MASK processing to obtain MASK sequences of [ m, d, h, MASK, d, k, h, k, g, g, h, m, m, n, n, k, MASK, h, m, m]。
Step S103: and inputting the mask sequence into at least one first detection model to perform first detection processing to obtain first detection scores of the mask sequence, and obtaining detection probability vectors of the mask sequence based on the at least one first detection scores.
In some embodiments, after obtaining at least one mask sequence corresponding to the word sequence, the text information detection device inputs each mask sequence into at least one first detection model one by one to perform first detection processing, so as to obtain a first detection score of each mask sequence, and can obtain a detection probability vector of the mask sequence based on the at least one first detection score, so that the detection probability vector can be conveniently input into a second detection model for detection, and the accuracy and reliability of text information detection are improved. In this embodiment, the first detection model is not excessively limited, that is, the first detection model may be a model that can fulfill the conventional requirements of processing text input and output scores. Thus, the first detection model may be a base model. It is understood that the base model refers to a model used as a base in machine learning. The base model generally refers to a model that works well on a particular task and can be used as a reference or starting point for comparison of other models. The base model may be a simple model, such as linear regression, logistic regression, etc., or a complex model, such as a support vector machine, decision tree, random forest, etc. The key factors in selecting the base model are determined by the requirements of the particular task and the characteristics of the data. The architecture of the first detection model is not limited to a recurrent neural network model, a convolutional neural network model, BERT, GPT, etc.
In some embodiments, in order to make the result of text information detection more accurate, the number of first detection models needs to be increased.
Thus, referring to fig. 7, the first detection process is performed by inputting the mask sequence into at least one first detection model to obtain first detection scores of the mask sequence, and detection probability vectors of the mask sequence are obtained based on the at least one first detection scores, including steps S701 to S704.
Step S70l: and carrying out vectorization processing on the mask sequence to obtain a mask vector.
In some embodiments, when the number of the first detection models is b > 1, the text information detecting apparatus first needs to perform vectorization processing on the mask sequence to obtain mask vectors, so as to facilitate subsequent processing of all the first detection models. This is because the base model is not fixed, and the vectorized flow cannot be uniformly represented. For example, in a convolutional neural network model, a Word sequence needs to be converted into a two-dimensional vector by a vector model Word2 Vec; in the BERT model, only text needs to be input because of the embedded vector representation layer.
Step S702: and respectively inputting the mask vector into a plurality of first detection models to perform first detection processing, and obtaining a first probability score and a second probability score output by each first detection model.
Step S703: and obtaining a first detection score of the mask sequence according to the first probability score and the second probability score.
In some embodiments, after obtaining the mask vector corresponding to the mask sequence, the text information detection device inputs the mask sequence into a plurality of first detection models respectively to perform first detection processing to obtain a first probability score p1 and a second probability score p2 output by each detection model, and obtains a first detection score [ p1, p2] of the mask sequence based on the first probability score p1 and the second probability score p 2. It will be appreciated that the first probability score represents the probability that the mask sequence is compliant and the second probability score represents the probability that the mask sequence is not compliant.
Step S704: and splicing the plurality of first detection scores to obtain the detection probability vector of the mask sequence.
In some embodiments, the text information detecting apparatus sequentially inputs a plurality of first detection models for each mask sequence of each word sequence, thereby obtaining a first detection score [ p1, p2] output by each first detection model]The method comprises the steps of carrying out a first treatment on the surface of the Then concatenating the plurality of first detection scores to obtain a detection probability for each masking sequence of each word sequenceRate vectorThe text information is accurately detected by inputting the second detection model for processing. It will be appreciated that the number of elements of the detection probability vector is twice that of the first detection model, i.e. 2b.
In some embodiments, the text information detecting device inputs a plurality of first detection models to the c mask sequences of each word sequence, and c detection probability vectors of the word sequence can be obtained as follows:
step S104: and sequentially inputting the detection probability vectors corresponding to each word sequence into a second detection model to carry out second detection processing, obtaining second detection scores consistent with the number of the word sequences, and obtaining a detection result of the text to be detected based on the second detection scores.
In some embodiments, after obtaining the detection probability vector corresponding to each word sequence, the text information detection device sequentially inputs the detection probability vector into the second detection model to perform second detection processing, obtains second detection scores consistent with the number of the word sequences, and obtains a detection result of the text to be detected based on the second detection scores, thereby obtaining a more accurate detection result of the text to be detected, and further improving the use experience of the user. In the present embodiment, the second detection model is not limited, that is, any model that can meet the requirements of conventional numerical vector input and output scoring may be used. Thus, the second detection model may be a meta model, which is understood to refer to a model used in machine learning for modeling, analyzing and optimizing other models. A meta-model can be seen as a model of a model that provides more accurate or interpretable predicted results by aggregating, combining, or further processing the results of other models. The architecture of the second detection model may be a conventional machine learning model, such as decision trees, lifting trees, random forests, support vector machines, etc.
In some embodiments, to increase the accuracy of text information detection, the number of masking sequences per word sequence needs to be increased.
Therefore, referring to fig. 8, the detection probability vectors corresponding to each word sequence are sequentially input into the second detection model to perform the second detection process, and the second detection scores corresponding to the number of word sequences are obtained, which includes steps S801 to S803.
Step S801: and sequentially inputting a plurality of detection probability vectors corresponding to each word sequence into a second detection model to carry out second detection processing, so as to obtain a third detection score of each detection probability vector.
In some embodiments, when the number of mask sequences corresponding to each word sequence is greater than 1, i.e., c>1, a plurality of detection probability vectors corresponding to each word sequence can be obtained through the steps S701 to S704 described above. The text information detecting device then uses the plurality of detection probability vectors corresponding to each word sequence: sequentially inputting the first detection model to obtain a plurality of first detection scores corresponding to each word sequence> The detection result judgment of the output text information is conveniently carried out according to the third detection score, so that the accuracy and reliability of text information detection are improved.
Step S802: and accumulating the plurality of third detection scores to obtain detection accumulated scores.
Step S803: and calculating the average value of the detection accumulation scores to obtain a second detection score of the word sequence.
In some embodiments, a plurality of third detection scores for each word sequence are obtained Then, the text information detection device accumulates a plurality of third detection scores to obtain detection accumulation scores corresponding to each word sequence>Then, the average value of the detection accumulation scores is calculated to obtain a second detection score corresponding to each word sequence>The text to be detected is detected according to the second detection score of each word sequence, so that the accuracy and reliability of text information detection are improved.
In some embodiments, after the second detection score for each word sequence is obtained, a further threshold detection for each word sequence is required in order to improve the accuracy and reliability of text information detection.
Therefore, referring to fig. 9, a detection result of the text to be detected is obtained based on the second detection score, including steps S901 to S904.
Step S901: and obtaining a first adjustment parameter according to the preset first weight and the plurality of first detection scores.
Step S902: and obtaining a second adjustment parameter according to the preset second weight and a plurality of third detection scores.
Step S903: and adjusting a preset threshold according to the first adjustment parameter and the second adjustment parameter.
Step S904: and obtaining a detection result according to a comparison result of the second detection score and a preset threshold value.
In some embodiments, the text information detecting means is operative to obtain each word sequenceAfter the second detection score, the first weight [ beta ] is preset 12 ]And a plurality of first detection scores [ p1, p2 ] for each mask sequence of each word sequence]Obtaining a first adjustment parameterAnd according to a preset second weight [ beta ] 34 ]And a plurality of first detection scores [ p1, p2 ] for each mask sequence of each word sequence]Obtaining a first adjustment parameter->Next, according to the first adjustment parameter alpha 1 And a second adjustment parameter alpha 2 Adjusting a preset threshold epsilon=epsilon 012 Wherein ε is 0 A preset threshold value which is preset initially; then the second detection score of each word sequence +.>Comparing with a preset threshold epsilon, if the second detection score is greater than the preset threshold epsilon, namelyThe detection result of the word sequence is judged to be 'compliance', otherwise, the word sequence is judged to be 'non-compliance', so that a more accurate text information detection process is realized, and the accuracy and the reliability of text information detection are improved. In the present embodiment, the preset parameter [ ε ] is not set 01234 ]And the regulation is carried out according to actual needs.
In some embodiments, the first weight [ β 12 ]And a second weight [ beta ] 34 ]Can be set to 0, i.e. the preset threshold epsilon is equal to the initial preset threshold epsilon 0 . Therefore, in this case, the judgment condition of whether each word sequence is "compliant" is: second detection scoreWhether or not it is greater than an initially preset threshold epsilon 0 If the second detection score is greater than the initial preset threshold value, namely +.>Then the detection result of the word sequence is judged to be "compliance", otherwise "non-compliance".
In some embodiments, all word sequences { w } for simplified word sequence X of the text under test 1 ,w 2 ,…,w n Comparing the second detection scores with a preset threshold value to obtain detection results of all word sequences, and if the detection results of all word sequences are "compliance", judging the detection results of the text to be detected as "compliance"; if the detection result of at least one word sequence is 'non-compliance', the detection result of the text to be detected is judged to be 'non-compliance', so as to obtain a precise and reliable detection result of the text to be detected.
Referring to fig. 10, a schematic flow chart diagram of a text information detection method provided in an embodiment of the present application includes the following steps:
1) After the text information detection device acquires the text to be detected, inputting the text to be detected into a preprocessing module for word segmentation, simplification and division operation to obtain a plurality of word sequences, and sequentially performing the following steps 2) to 8) on each word sequence to obtain a second detection score of each word sequence;
2) Performing the word sequence a plurality of times step 3) below to obtain a plurality of mask sequences for the word sequence;
3) Inputting the word sequence into a mask processing module to generate a weight sequence, and selecting mask words according to the weight sequence and the generated random number, so that the word sequence is masked according to the mask words to obtain a mask sequence;
4) Sequentially performing the following steps 5) to 7) on the plurality of mask sequences of the word sequence to obtain a third detection score of each mask sequence of the word sequence;
5) Sequentially inputting the mask sequence into a plurality of first detection models to perform first detection so as to obtain a plurality of first detection scores of the mask sequence;
6) Splicing a plurality of first detection scores of the mask sequence to obtain a detection probability vector of the mask sequence;
7) Inputting the detection probability vector of the mask sequence into a second detection model to obtain a third detection score of the mask sequence;
8) Equally dividing third detection scores of a plurality of mask sequences of the word sequences to obtain second detection scores of the word sequences;
9) And inputting the second detection scores of the word sequences into a threshold judgment module to judge the second detection scores with a preset threshold, obtaining a detection result of the text to be detected and outputting the detection result.
According to the technical scheme provided by the embodiment of the application, the word segmentation, simplification and division operation are carried out on the text to be detected, so that a plurality of word sequences are obtained; then inputting each word sequence for multiple times into a mask processing module to generate a weight sequence, and selecting mask words according to the weight sequence and the generated random numbers, so that the word sequences are masked according to the mask words to obtain a plurality of mask sequences of the word sequences; sequentially inputting a plurality of mask sequences of each word sequence into a plurality of first detection models to perform first detection so as to obtain a plurality of first detection scores of each mask sequence, splicing the plurality of first detection scores of each mask sequence to obtain a detection probability vector of each mask sequence, sequentially inputting the probability detection vector of each mask sequence of each word sequence into a second detection model to perform second detection so as to obtain a third detection score of each mask sequence, and equally dividing the third detection scores of the plurality of mask sequences of each word sequence so as to obtain a second detection score of each word sequence; and finally, comparing the second detection score based on each word sequence with a preset threshold value, and outputting a detection result of the text to be detected according to the comparison result.
According to the method and the device for detecting the text information, firstly, the weight sequence of the word sequence is utilized, then the mask word is selected by combining the random number, so that the mask sequence is generated in a targeted mode, and the method and the device are different from the mode of randomly selecting the mask word to generate the mask sequence, and the mask word is selected by combining the weight sequence with the random number, so that the situation that the text word is easily repeatedly selected as a target mask word due to the fact that the frequency of a certain text word in the text sequence is too high is avoided, the accuracy of text information detection is reduced, the randomness of selecting the mask word in the text information detection process is improved, and the accuracy and the reliability of the text information are further improved. And then, carrying out first detection on the mask sequence by using a plurality of first detection models to obtain a detection probability vector, inputting the detection probability vector into a second detection model to carry out second detection, obtaining a second detection score of the word sequence, and inputting detection information into a plurality of detection models, so that the accuracy and reliability of text information detection can be effectively improved.
The embodiment of the present application further provides a text information detection device, which can implement the above text information detection method, and referring to fig. 11, the device 1100 includes:
An obtaining module 1110 is configured to obtain at least one word sequence of the text to be tested.
The mask processing module 1120 is configured to generate a weight sequence of the word sequence based on word weights of the text words in the word sequence, select mask words from the text words according to the weight sequence, and generate a mask sequence of the word sequence according to the mask words.
The first detection module 1130 is configured to input the mask sequence into at least one first detection model for performing a first detection process, obtain a first detection score of the mask sequence, and obtain a detection probability vector of the mask sequence based on the at least one first detection score.
The second detection module 1140 is configured to sequentially input the detection probability vectors corresponding to each word sequence into the second detection model for performing a second detection process, obtain second detection scores consistent with the number of the word sequences, and obtain a detection result of the text to be detected based on the second detection scores.
The specific implementation manner of the text information detection device of this embodiment is substantially identical to the specific implementation manner of the text information detection method described above, and will not be described herein.
The embodiment of the application also provides electronic equipment, which comprises:
at least one memory;
At least one processor;
at least one program;
the program is stored in the memory, and the processor executes the at least one program to implement the text information detection method described above. The electronic equipment can be any intelligent terminal including a mobile phone, a tablet personal computer, a personal digital assistant (Personal Digital Assistant, PDA for short), a vehicle-mounted computer and the like.
Referring to fig. 12, fig. 12 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:
the processor 1201 may be implemented by a general purpose CPU (central processing unit), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present application;
the memory 1202 may be implemented in the form of a ROM (read only memory), a static storage device, a dynamic storage device, or a RAM (random access memory). The memory 1202 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present application are implemented by software or firmware, relevant program codes are stored in the memory 1202, and the processor 1201 invokes a text information detection method for executing the embodiments of the present application;
An input/output interface 1203 for implementing information input and output;
the communication interface 1204 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g., USB, network cable, etc.), or may implement communication in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.);
a bus 1205 for transferring information between various components of the device such as the processor 1201, memory 1202, input/output interface 1203, and communication interface 1204;
wherein the processor 1201, the memory 1202, the input/output interface 1203 and the communication interface 1204 enable communication connection between each other inside the device via a bus 1205.
The embodiment of the application also provides a storage medium, which is a computer readable storage medium, and the storage medium stores a computer program, and the computer program realizes the text information detection method when being executed by a processor.
The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.
It will be appreciated by those skilled in the art that the technical solutions shown in the figures do not constitute limitations of the embodiments of the present application, and may include more or fewer steps than shown, or may combine certain steps, or different steps.
The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.
The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.
Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims (12)

1. A text information detection method, characterized by comprising:
acquiring at least one word sequence of a text to be tested, wherein the word sequence comprises at least one text word;
generating a weight sequence of the word sequence based on the word weight of the text word in the word sequence, selecting a mask word from the text word according to the weight sequence, and generating a mask sequence of the word sequence according to the mask word;
inputting the mask sequence into at least one first detection model for first detection processing to obtain a first detection score of the mask sequence, and obtaining a detection probability vector of the mask sequence based on at least one first detection score;
and sequentially inputting the detection probability vectors corresponding to each word sequence into a second detection model to carry out second detection processing, obtaining second detection scores consistent with the number of the word sequences, and obtaining a detection result of the text to be detected based on the second detection scores.
2. The method of claim 1, wherein generating a weighted sequence of the word sequence based on word weights of the text words in the word sequence comprises:
Obtaining the text weight of each text word based on the word frequency of each text word in the word sequence;
selecting different text words to form a text sequence according to the sequence of the text words;
and generating a weight element of each text word in the weight sequence based on the text word sequence.
3. The method of claim 2, wherein generating a weight element for each text word in the weight sequence based on the text word order comprises:
generating the weight element of the first text word according to the text weight of the first text word;
and adding the weight element of the previous text word and the text weight of the current text word based on the text word sequence to obtain the weight element of the current text word.
4. A method for detecting text information according to claim 3, wherein selecting mask words from the text words according to the weight sequence comprises:
generating at least one random number according to the number of weight elements of the weight sequence;
traversing the weight elements in the weight sequence, and selecting the smallest weight element in the weight elements which are larger than or equal to the random number as a target weight;
And selecting the text word corresponding to the target weight as the mask word.
5. The method for detecting text information according to claim 4, wherein when the target weights corresponding to at least two of the random numbers are the same, the selecting the text word corresponding to the target weight as the mask word includes:
selecting the target weight corresponding to the previous random number as a candidate weight based on the selection sequence of the random number;
taking the weight element of the candidate weight which is the next weight as the target weight corresponding to the next random number;
and respectively selecting the text words corresponding to the candidate weight and the target weight as the mask words.
6. The method for detecting text information according to claim 1, wherein the plurality of first detection models are provided, the inputting the mask sequence into at least one first detection model for performing a first detection process to obtain a first detection score of the mask sequence, and obtaining a detection probability vector of the mask sequence based on at least one first detection score includes:
vectorizing the mask sequence to obtain a mask vector;
The mask vector is respectively input into a plurality of first detection models to carry out first detection processing, and a first probability score and a second probability score output by each first detection model are obtained;
obtaining a first detection score of the mask sequence according to the first probability score and the second probability score;
and splicing the plurality of first detection scores to obtain the detection probability vector of the mask sequence.
7. The method for detecting text information according to claim 1, wherein the mask sequence of each word sequence is plural, the detecting probability vector corresponding to each word sequence is sequentially input into a second detecting model to perform a second detecting process, and a second detecting score consistent with the number of the word sequences is obtained, including:
sequentially inputting a plurality of detection probability vectors corresponding to each word sequence into a second detection model to carry out second detection processing, so as to obtain a third detection score of each detection probability vector;
accumulating the plurality of third detection scores to obtain detection accumulated scores;
and calculating the average value of the detection accumulation scores to obtain a second detection score of the word sequence.
8. The method for detecting text information according to claim 1, wherein the step of obtaining at least one word sequence of the text to be detected includes:
word segmentation is carried out on the text to be detected to obtain a word sequence to be detected;
removing the stop words contained in the word sequence to be detected based on the stop word list to obtain a simplified word sequence;
and dividing the simplified word sequence to obtain at least one word sequence.
9. The method for detecting text information according to claim 7, wherein the obtaining the detection result of the text to be detected based on the second detection score includes:
obtaining a first adjustment parameter according to a preset first weight and a plurality of first detection scores;
obtaining a second adjustment parameter according to a preset second weight and a plurality of third detection scores;
adjusting a preset threshold according to the first adjustment parameter and the second adjustment parameter;
and obtaining a detection result according to the comparison result of the second detection score and the preset threshold value.
10. A text information detecting apparatus, characterized in that the apparatus comprises:
the acquisition module is used for acquiring at least one word sequence of the text to be detected;
The mask processing module is used for generating a weight sequence of the word sequence based on the word weight of the text word in the word sequence, selecting mask words from the text word according to the weight sequence, and generating a mask sequence of the word sequence according to the mask words;
the first detection module is used for inputting the mask sequence into at least one first detection model to perform first detection processing to obtain a first detection score of the mask sequence, and obtaining a detection probability vector of the mask sequence based on at least one first detection score;
and the second detection module is used for sequentially inputting the detection probability vectors corresponding to each word sequence into a second detection model for second detection processing to obtain second detection scores consistent with the number of the word sequences, and obtaining a detection result of the text to be detected based on the second detection scores.
11. An electronic device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the text information detection method of any of claims 1 to 9 when executing the computer program.
12. A computer-readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the text information detection method according to any one of claims 1 to 9.
CN202311214190.2A 2023-09-19 2023-09-19 Text information detection method, device, equipment and storage medium Pending CN117332038A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311214190.2A CN117332038A (en) 2023-09-19 2023-09-19 Text information detection method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311214190.2A CN117332038A (en) 2023-09-19 2023-09-19 Text information detection method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN117332038A true CN117332038A (en) 2024-01-02

Family

ID=89282126

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311214190.2A Pending CN117332038A (en) 2023-09-19 2023-09-19 Text information detection method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN117332038A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783443A (en) * 2020-06-29 2020-10-16 百度在线网络技术(北京)有限公司 Text disturbance detection method, disturbance reduction method, disturbance processing method and device
CN112069795A (en) * 2020-08-28 2020-12-11 平安科技(深圳)有限公司 Corpus detection method, apparatus, device and medium based on mask language model
CN113628043A (en) * 2021-09-17 2021-11-09 平安银行股份有限公司 Complaint validity judgment method, device, equipment and medium based on data classification
WO2022160447A1 (en) * 2021-01-28 2022-08-04 平安科技(深圳)有限公司 Text error correction method, apparatus and device, and storage medium
CN116258137A (en) * 2023-03-03 2023-06-13 华润数字科技有限公司 Text error correction method, device, equipment and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111783443A (en) * 2020-06-29 2020-10-16 百度在线网络技术(北京)有限公司 Text disturbance detection method, disturbance reduction method, disturbance processing method and device
CN112069795A (en) * 2020-08-28 2020-12-11 平安科技(深圳)有限公司 Corpus detection method, apparatus, device and medium based on mask language model
WO2022160447A1 (en) * 2021-01-28 2022-08-04 平安科技(深圳)有限公司 Text error correction method, apparatus and device, and storage medium
CN113628043A (en) * 2021-09-17 2021-11-09 平安银行股份有限公司 Complaint validity judgment method, device, equipment and medium based on data classification
CN116258137A (en) * 2023-03-03 2023-06-13 华润数字科技有限公司 Text error correction method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN113792818B (en) Intention classification method and device, electronic equipment and computer readable storage medium
CN110750640B (en) Text data classification method and device based on neural network model and storage medium
CN110704621B (en) Text processing method and device, storage medium and electronic equipment
CN112270196B (en) Entity relationship identification method and device and electronic equipment
CN111444320A (en) Text retrieval method and device, computer equipment and storage medium
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN112528637A (en) Text processing model training method and device, computer equipment and storage medium
CN113378970B (en) Sentence similarity detection method and device, electronic equipment and storage medium
CN114780831A (en) Sequence recommendation method and system based on Transformer
CN114358201A (en) Text-based emotion classification method and device, computer equipment and storage medium
CN115640394A (en) Text classification method, text classification device, computer equipment and storage medium
CN116258137A (en) Text error correction method, device, equipment and storage medium
CN112418320A (en) Enterprise association relation identification method and device and storage medium
CN113449508B (en) Internet public opinion correlation deduction prediction analysis method based on event chain
CN114490949A (en) Document retrieval method, device, equipment and medium based on BM25 algorithm
CN114358020A (en) Disease part identification method and device, electronic device and storage medium
CN114492661A (en) Text data classification method and device, computer equipment and storage medium
CN116741396A (en) Article classification method and device, electronic equipment and storage medium
CN114398903B (en) Intention recognition method, device, electronic equipment and storage medium
CN115796141A (en) Text data enhancement method and device, electronic equipment and storage medium
CN117332038A (en) Text information detection method, device, equipment and storage medium
CN114398482A (en) Dictionary construction method and device, electronic equipment and storage medium
CN114492437A (en) Keyword recognition method and device, electronic equipment and storage medium
CN113343235A (en) Application layer malicious effective load detection method, system, device and medium based on Transformer
CN115269779A (en) Recommendation model training method, recommendation method and device, electronic device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination