CN109271489B

CN109271489B - Text detection method and device

Info

Publication number: CN109271489B
Application number: CN201811247368.2A
Authority: CN
Inventors: 杨俊�; 陈立
Original assignee: 4Paradigm Beijing Technology Co Ltd
Current assignee: 4Paradigm Beijing Technology Co Ltd
Priority date: 2018-10-25
Filing date: 2018-10-25
Publication date: 2020-12-15
Anticipated expiration: 2038-10-25
Also published as: CN109271489A

Abstract

The invention discloses a text detection method and a text detection device, relates to the technical field of text detection, and mainly aims to improve the accuracy of text detection. The main technical scheme of the invention is as follows: acquiring a text to be detected; determining co-occurrence words of the target phrase; determining paragraphs to be detected in the text to be detected by utilizing the co-occurrence words; detecting whether the paragraph to be detected contains content matched with the target phrase by using a machine learning model; and determining whether the text to be detected meets the requirements or not according to the matching result and the detection rule corresponding to the target phrase. The method is used for normative detection of the text.

Description

Text detection method and device

Technical Field

The present invention relates to the field of text detection technologies, and in particular, to a text detection method and apparatus.

Background

With the advent and development of the internet, the number of text documents on the internet has been more rapidly developed, and text detection technology has been widely applied in the fields of document retrieval, public opinion management, and the like, and determines whether text content contains data content required by a user by detecting whether the text contains information such as phrases or sentences set by the user.

At present, most of common text detection is to perform matching based on keywords set by a user, or to calculate a plurality of near-meaning words of the keywords by using a natural language processing technology and perform text detection by using the words. However, in many cases, especially in the case of a word with multiple meanings, the semantic meaning of the keyword set by the user is difficult to determine by using the existing detection method, and the search and matching can not be performed by using a simple near-meaning word, so that the detection result obtained by using the existing detection method still has the problems of low accuracy and narrow application range.

Disclosure of Invention

In view of the above problems, the present invention provides a text detection method and apparatus, and mainly aims to improve the accuracy of text detection.

In order to achieve the purpose, the invention mainly provides the following technical scheme:

in one aspect, the present invention provides a text detection method, which specifically includes:

acquiring a text to be detected;

determining co-occurrence words of the target phrase;

determining paragraphs to be detected in the text to be detected by utilizing the co-occurrence words;

detecting whether the paragraph to be detected contains content matched with the target phrase by using a machine learning model;

and determining whether the text to be detected meets the requirements or not according to the matching result and the detection rule corresponding to the target phrase.

Preferably, the determining the paragraphs to be detected in the text to be detected by using the co-occurrence words comprises:

for each co-occurrence word of the target phrase, searching the co-occurrence word in the text to be detected;

determining a paragraph to be detected according to the position of the co-occurrence word in the text to be detected and the preset paragraph length when the co-occurrence word is found;

and if the target phrase has no co-occurrence word or the target phrase has the co-occurrence word but the co-occurrence word does not exist in the text to be detected, determining the whole text to be detected as a paragraph to be detected.

Preferably, the determining a paragraph to be detected according to the position of the co-occurrence word in the text to be detected and the preset paragraph length includes:

in the text to be detected, characters with preset lengths are taken before and after the found co-occurrence word to obtain a paragraph to be detected;

or,

and in the text to be detected, selecting a paragraph which has the length equal to the preset length and contains the found co-occurrence word as the paragraph to be detected.

Preferably, the detecting, by using a machine learning model, whether the text passage to be detected contains content that matches the target phrase includes:

detecting whether the similarity between the word vector of the phrase in the paragraph to be detected and the word vector of the target phrase reaches a preset similarity value or not by using a word vector model; the word vector model is a model for identifying synonyms, which is obtained based on neural network training;

and if so, determining that the text to be detected contains the content matched with the target phrase.

Preferably, the determining co-occurrence of the target phrase comprises:

and acquiring corresponding co-occurrence words from a preset co-occurrence word list according to the target phrase, wherein the preset co-occurrence word list is obtained based on text set statistics in the same technical field as the text to be detected.

Preferably, the method further comprises:

when the named entity corresponding to the target phrase needs to be identified, calling a named entity identification model to judge whether the named entity name corresponding to the target phrase exists in the text to be detected;

if the text to be detected exists, the text to be detected meets the requirement;

and if the text to be detected does not exist, the text to be detected does not meet the requirement.

Preferably, the acquiring the text to be detected includes:

and converting the voice to be tested into a text to obtain the text to be tested.

In another aspect, the present invention provides a text detection apparatus, which specifically includes:

the text acquisition unit is used for acquiring a text to be detected;

a co-occurrence word determination unit for determining co-occurrence words of the target phrase;

a paragraph determining unit, configured to determine a paragraph to be detected in the text to be detected obtained by the text obtaining unit by using the co-occurrence word obtained by the co-occurrence word determining unit;

a content matching unit, configured to detect, by using a machine learning model, whether the to-be-detected paragraph determined by the paragraph determination unit contains content that matches the target phrase;

and the text detection unit is used for determining whether the text to be detected meets the requirements or not according to the matching result obtained by the content matching unit and the detection rule corresponding to the target phrase.

Preferably, the paragraph determination unit includes:

the searching module is used for searching each co-occurrence word of the target phrase in the text to be detected;

the first determining module is used for determining a paragraph to be detected according to the position of the co-occurrence word in the text to be detected and the preset paragraph length when the co-occurrence word is searched by the searching module;

and the second determining module is used for determining the whole text to be detected as the paragraph to be detected if the target phrase has no co-occurrence word or the target phrase has the co-occurrence word but the co-occurrence word does not exist in the text to be detected.

Preferably, the first determining module comprises:

the first determining submodule is used for taking characters with preset lengths in front of and behind the found co-occurrence word in the text to be detected to obtain a paragraph to be detected;

and the second determining submodule is used for selecting the paragraph which has the length equal to the preset length and contains the searched co-occurrence word from the text to be detected as the paragraph to be detected.

Preferably, the content matching unit includes:

the word vector detection module is used for detecting whether the similarity between the word vector of the phrase in the paragraph to be detected and the word vector of the target phrase reaches a preset similarity value or not by using a word vector model; the word vector model is a model for identifying synonyms, which is obtained based on neural network training;

and the determining module is used for determining that the text to be detected contains the content matched with the target phrase when the word vector detecting module determines that the similarity value reaches a preset similarity value.

Preferably, the co-occurrence word determining unit is specifically configured to obtain the corresponding co-occurrence word from a preset co-occurrence word list according to the target phrase, where the preset co-occurrence word list is obtained based on statistics of a text set in the same technical field as the text to be detected.

Preferably, the apparatus further comprises:

the named entity detection unit is used for calling a named entity identification model to judge whether the named entity name corresponding to the target phrase exists in the text to be detected when the named entity corresponding to the target phrase needs to be identified; if the text to be detected exists, the text to be detected meets the requirement; and if the text to be detected does not exist, the text to be detected does not meet the requirement.

Preferably, the text acquisition unit is specifically configured to convert the voice to be quality-checked into a text, so as to obtain the text to be detected.

In another aspect, the present invention provides a computer-readable storage medium, wherein the computer-readable storage medium has a computer program stored thereon, and wherein the computer program, when executed by one or more computing devices, implements the text detection method described above.

In another aspect, the present invention provides a system comprising one or more computing devices and one or more storage devices having a computer program recorded thereon, which when executed by the one or more computing devices, causes the one or more computing devices to implement the text detection method described above.

By means of the technical scheme, when the text to be detected is detected to be whether the text to be detected contains the target phrase, the co-occurrence words of the target phrase are used for determining the paragraph to be detected in the text to be detected, the position of the paragraph possibly appearing in the words matched with the target phrase is determined, the preliminary filtering of the sentences in the text to be detected is achieved, then the paragraphs to be detected are inquired and matched one by one in the paragraph to be detected by using the machine learning model, whether the content matched with the target phrase is contained in the paragraphs to be detected is judged, the detection by using the machine learning model can judge the semantics of the target phrase more accurately, and therefore the detection accuracy of the text to be detected is improved.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:

fig. 1 shows a flowchart of a text detection method according to an embodiment of the present invention;

FIG. 2 is a flow chart of another text detection method proposed by the embodiment of the present invention;

fig. 3 is a block diagram showing a text detection apparatus according to an embodiment of the present invention;

fig. 4 is a block diagram showing another text detection apparatus according to an embodiment of the present invention.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the invention are shown in the drawings, it should be understood that the invention can be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art.

The embodiment of the invention provides a text detection method which is used for detecting whether a text contains contents matched with a target phrase or not and determining whether the text meets requirements or not according to a judgment result. The method comprises the following specific steps as shown in figure 1:

101. and acquiring the text to be detected.

102. Co-occurring words of the target phrase are determined.

Where co-occurring words refer to words or phrases that are often used in conjunction with a target phrase or that co-occur with a target phrase in a text. The number of the determined co-occurrence words is not limited in the embodiment of the invention, and the specific determination process of the co-occurrence words can be manually input by a user, namely the co-occurrence words of the target phrase specified by the user; the co-occurrence words corresponding to the target phrase may also be obtained through data analysis and statistics, so as to directly obtain the co-occurrence words, generally, the co-occurrence words obtained through data analysis and statistics are obtained through word segmentation statistics on a large number of texts, for example, words appearing in a range adjacent or neighboring to the target phrase are counted, and words appearing with a frequency exceeding a preset value are determined as the co-occurrence words of the target phrase.

103. And determining the paragraphs to be detected in the text to be detected by utilizing the co-occurrence words.

The paragraph to be detected is not specifically limited in the embodiment of the present invention to be a sentence paragraph, a sentence, or a phrase of a plurality of words in the text to be detected.

The purpose of this step is to mark or screen out paragraphs containing co-occurrence words in the text to be detected, and it should be noted that, because there may be a plurality of co-occurrence words, there may also be a plurality of co-occurrence words in a certain paragraph to be detected.

104. And detecting whether the paragraph to be detected contains the content matched with the target phrase by using a machine learning model.

In the embodiment of the invention, the machine learning model is a model which is obtained by training a neural network and can identify phrases or words matched with the semantics of target phrases. In practical application, the input of the machine learning model is a paragraph to be detected and a target phrase, and the output is a matching result, namely the paragraph to be detected is matched or not matched with the target phrase, the matching indicates that words with the same or similar semantics as the target phrase exist in the paragraph to be detected, and the non-matching indicates that all the paragraphs to be detected in the text to be detected do not have words with the same or similar semantics as the target phrase.

105. And determining whether the text to be detected meets the requirements or not according to the matching result and the detection rule corresponding to the target phrase.

Specifically, the step is used for classifying a large number of texts to be detected, namely extracting texts meeting requirements. In practical application, when determining whether a text to be detected meets the requirement, and when determining a target phrase, the detection rule may be specifically classified as that the text to be detected meets the requirement when the matching result is matching, or that the text to be detected meets the requirement when the matching result is not matching. In general, a plurality of target phrases are set to detect a text to be detected, and at this time, it is also necessary to determine whether the text to be detected meets the requirements according to a preset determination rule, for example, when the text to be detected is successfully matched with any one of the target phrases, it is determined that the text meets the requirements; or, when the text to be detected is successfully matched with all the target phrases, determining that the text meets the requirement.

As can be seen from the embodiment shown in fig. 1, in the embodiment of the present invention, when detecting a text, the text is mainly located at a detection position by using co-occurring words corresponding to a target phrase, then semantic matching is performed on the located to-be-detected paragraphs by using a machine learning model to determine whether words matched with the target phrase exist therein, and finally, whether the text meets requirements is determined according to the detection rule of the target phrase. The co-occurrence words and the machine learning model used in the scheme utilize intelligent algorithms such as big data statistics and neural networks, so that when the text is detected, whether the semantics of the text is the same as the semantics of the target phrase can be more accurately judged, and the accuracy of text detection is further improved.

Further, in another embodiment of the present invention, when the text detection method is applied to the field of voice detection, where a text to be detected is obtained after a voice to be detected is converted into a text, the field mainly uses voice services as main industries, such as artificial customer service in telephone and internet e-commerce, to generate a large amount of voice data every day, and in order to objectively evaluate the service level of the artificial customer service, the voice data needs to be detected to determine whether there is a symbolic speech that can be used for evaluation, that is, a target phrase. And then evaluating the service process of the customer service personnel based on the target phrase. Based on the foregoing application scenario, the embodiment of the present invention describes in detail a specific process of text detection by using the text detection method shown in fig. 2:

201. and converting the voice to be tested into a text to obtain the text to be tested.

The specific mode of converting the voice into the text is not limited, and the existing any voice recognition program can be used for converting the voice data into the text to generate the text to be detected.

202. And acquiring corresponding co-occurrence words from a preset co-occurrence word list according to the target phrases.

Specifically, the target phrase is set by the user, and is generally a specific condition for determining whether the text meets the requirement, for example, whether the term of the customer service or the consultant is professional or standard can be determined by the target phrase. Generally, the set number of the target phrases may be one or more, and when there are a plurality of target phrases, corresponding co-occurrence words are acquired for each target phrase one by one to form a co-occurrence word set.

The preset co-occurrence word list in the embodiment is obtained based on the statistics of the text set in the same technical field as the text to be detected. The text set in the same technical field may be obtained by classifying the texts in the public data set according to a preset technical field, selecting the technical field to which the text to be detected belongs by a user, thereby determining the text set in the same technical field, performing word segmentation statistics by using the texts in the text set, creating a preset co-occurrence word list corresponding to the technical field, and recording word segments having co-occurrence words in the text set in the preset co-occurrence word list, wherein the co-occurrence times or frequency of the co-occurrence words and the word segments should be greater than a preset value.

It should be noted that the preset co-occurrence word list changes with the change of the text content in the text set, because the content is updated with the addition of new text, and the co-occurrence words of each participle in the preset co-occurrence word list also change accordingly. Therefore, the preset co-occurrence word list in the step is a data list with dynamic updating capability, when the target phrase is obtained for text detection, the target phrase is searched from the preset co-occurrence word list in real time, if the target phrase exists, the corresponding co-occurrence word is obtained, and if the target phrase does not exist, the target phrase is prompted to have no co-occurrence word.

203. And determining the paragraphs to be detected in the text to be detected by utilizing the co-occurrence words.

Specifically, since the target phrase may have co-occurrence words or may not have co-occurrence words in the preset co-occurrence word list, the step determines the paragraphs to be detected respectively according to the two cases.

Firstly, when co-occurrence words corresponding to target phrases exist, the co-occurrence words are used for searching in a text to be detected one by one, and when one co-occurrence word is found, a paragraph to be detected is determined according to the position of the co-occurrence word in the text to be detected and the length of a preset paragraph. The preset paragraph length is generally measured by the number of characters, and it is considered that the words or phrases contained in the preset paragraph length may contain target phrases or phrases having similar semantics with the target phrases.

In the embodiment of the present invention, the feasible manner of determining a paragraph to be detected according to the preset paragraph length includes:

firstly, in the text to be detected, characters with preset lengths are taken before and after the found co-occurrence word, and a paragraph to be detected is obtained.

Secondly, in the text to be detected, the paragraph with the length equal to the preset length and containing the found co-occurrence word is taken as the paragraph to be detected.

The two modes are different in the position of the co-occurrence word in the paragraph to be detected, the position of the first mode is relatively fixed, and the position of the second mode is not fixed, namely all possible paragraph intercepting modes which accord with the paragraph conditions with the length which is the same as the preset length and contains the co-occurrence word are extracted. For example, if the co-occurrence word is a, the preset length is 3 letters, and a paragraph containing a in the text to be detected is BCADF, if the preset lengths of the first and second paragraphs are 1 letter, the paragraph to be detected obtained in the first mode is CAD, and the paragraph to be detected obtained in the second mode is three paragraphs BCA, CAD, and ADF.

Second, for the case where there is no co-occurrence word corresponding to the target phrase. At this time, the whole text to be detected is determined as the paragraph to be detected.

In addition, for the case that the co-occurrence word corresponding to the target phrase exists but the co-occurrence word is not found in the text to be detected, at this time, the whole text to be detected is also determined as the paragraph to be detected.

204. And detecting whether the paragraph to be detected contains the content matched with the target phrase by using a machine learning model.

Specifically, in this embodiment, the machine learning model adopts a word vector model, and the word vector model is a model for identifying synonyms obtained based on neural network training. When the word vector model is trained, a large amount of corpus data is needed to be used for word segmentation, a neural network model is constructed, word vectors are obtained through training, when the word vector model is actually used, segmented words or phrases in a paragraph to be detected and target phrases are input into the word vector model, the word vector model outputs word vectors of the segmented words and compares the word vectors with word vectors of the target phrases, namely whether the similarity between the word vectors reaches a preset similarity value or not is calculated, if yes, it is determined that the text to be detected contains contents matched with the target phrases, and at the moment, the word vector model outputs a matching result which is successful in matching; and if all the paragraphs to be detected in the text to be detected fail to be matched with the target phrase, outputting a matching result that the text to be detected does not contain the target phrase.

In one embodiment of the invention, content that matches phrases in a quality check entry refers to synonymous phrases (also referred to as near phrases). For example, if the phrase in the quality inspection item is "thank you", and the quality inspection rule requires that the phrase must appear, then "thank you" and its synonymous phrases, such as "thank you", are searched for in the paragraph to be inspected.

205. And determining whether the text to be detected meets the requirements or not according to the matching result and the detection rule corresponding to the target phrase.

This step is the same as step 105 in fig. 1, and is not described here again.

206. And carrying out named entity detection on the text to be detected by utilizing the named entity recognition model.

The named entity detection is further detection performed on specific contents corresponding to the target phrase, for example, when the target phrase is an "identity card number", the named entity corresponding to the target phrase is a specific 18-digit number or letter; when the target phrase is the 'addressee', the corresponding named entity is specific address information including information of regions, streets, house numbers and the like.

This is because, in the case of insurance telemarketing, etc., it is necessary for telemarketers to say actual names and addresses of persons while confirming insurance applicants, beneficiaries, addresses, etc. with the customers. Taking the marketing of the telephone car insurance as an example, when the business personnel needs to confirm the information such as name and address with the applicant after the car insurance is sold successfully, if the business personnel simply uses the address to judge whether the information is confirmed, in practice, the business personnel can possibly confirm with the applicant that the address is not changed? "the address is not actually confirmed, and the open information cannot be matched with a sentence, so that a feasible method is to identify whether the phrase contains a valid address, which needs to be determined by using a named entity identification model.

Specifically, when this step is executed, the target phrase specifically requiring the detection of the named entity, the identification format of the corresponding named entity information, and other contents need to be set in the setting detection configuration file before the detection. During actual execution, a named entity recognition model is called to judge whether a named entity name corresponding to the target phrase exists in the text to be detected, such as a person name, a place name, a mechanism name and the like, and if the named entity name exists, the text to be detected is determined to meet the requirement; otherwise, determining that the text to be detected does not meet the requirement.

It should be noted that the named entity recognition model is similar to the word vector model described above, and is a model obtained by training a neural network and used for recognizing whether contents in a text contain a legal corresponding named entity. The training mode is to train the existing neural network model by using a labeled sample, and the specific mode is not limited in this embodiment.

Further, this step may be, in a possible embodiment of the present invention, further configured to perform further named entity detection on the text to be detected determined to meet the requirement in step 205, and determine that the text to be detected meets the requirement when the text to be detected has the named entity name corresponding to the target phrase, otherwise determine that the text to be detected does not meet the requirement.

As described above in detail with reference to fig. 2, the text detection apparatus provided in the embodiment of the present invention can also improve accuracy of detecting quality of voice data when applied to detecting voice data, and for a quality inspection item (target phrase) set by a user, whether the voice data contains the target phrase or content similar to the target phrase in semantic can be accurately identified through matching detection of a co-occurrence word and a word vector model, and the present invention further detects content in a text to be detected by using a named entity identification model, so as to achieve whether a corresponding legal named entity exists in a certain fixed range after the detected target phrase, and for the voice data, whether necessary named entity data exists in the voice data can be determined, so as to detect conversational paradigm of a customer service person or a consultant.

Further, as an implementation of the text detection method, an embodiment of the present invention provides a text detection apparatus, which is mainly used for detecting whether a text contains content matched with a target phrase, and classifying the text according to a determination result. For convenience of reading, details in the foregoing method embodiments are not described in detail again in this apparatus embodiment, but it should be clear that the apparatus in this embodiment can correspondingly implement all the contents in the foregoing method embodiments. As shown in fig. 3, the apparatus specifically includes:

a text acquiring unit 31, configured to acquire a text to be detected;

a co-occurrence determining unit 32 for determining co-occurrence of the target phrase;

a paragraph determining unit 33, configured to determine, by using the co-occurrence word obtained by the co-occurrence word determining unit 32, a paragraph to be detected in the text to be detected obtained by the text obtaining unit 31;

a content matching unit 34, configured to detect, by using a machine learning model, whether the to-be-detected passage determined by the passage determining unit 33 contains a content that matches the target phrase;

and the text detection unit 35 is configured to determine whether the text to be detected meets the requirements according to the matching result obtained by the content matching unit 34 and the detection rule corresponding to the target phrase.

Further, as shown in fig. 4, the paragraph determination unit 33 includes:

the searching module 331 is configured to search, for each co-occurrence word of the target phrase, the co-occurrence word in the text to be detected;

a first determining module 332, configured to determine, when the finding module 331 finds the co-occurrence word, one paragraph to be detected according to a position of the co-occurrence word in the text to be detected and a preset paragraph length;

the second determining module 333 is configured to determine, if the target phrase has no co-occurrence word, or the target phrase has a co-occurrence word but the searching module 331 does not find the co-occurrence word in the text to be detected, the entire text to be detected as a paragraph to be detected.

Further, as shown in fig. 4, the first determining module 332 includes:

the first determining sub-module 3321 is configured to obtain, in the text to be detected, characters with preset lengths before and after the found co-occurrence word, to obtain a paragraph to be detected;

the second determining submodule 3322 is configured to select, from the text to be detected, a paragraph that is equal to the preset length and includes the found co-occurrence word as a paragraph to be detected.

Further, as shown in fig. 4, the content matching unit 34 includes:

a word vector detection module 341, configured to detect, by using a word vector model, whether a similarity between a word vector of a phrase in the to-be-detected paragraph and a word vector of the target phrase reaches a preset similarity value; the word vector model is a model for identifying synonyms, which is obtained based on neural network training;

the determining module 342 is configured to determine that the text to be detected contains content matched with the target phrase when the word vector detecting module 341 determines that the similarity value reaches a preset similarity value.

Further, the co-occurrence word determining unit 32 is specifically configured to obtain the corresponding co-occurrence word from a preset co-occurrence word table according to the target phrase, where the preset co-occurrence word table is obtained based on statistics of a text set in the same technical field as the text to be detected.

Further, as shown in fig. 4, the apparatus further includes:

a named entity detecting unit 36, configured to, when a named entity corresponding to the target phrase needs to be identified, call a named entity identification model to determine whether a named entity name corresponding to the target phrase exists in the text to be detected; if the text exists, the text detection unit 35 determines that the text to be detected meets the requirements; if not, the text detection unit 35 determines that the text to be detected does not meet the requirements.

Further, the text obtaining unit 31 is specifically configured to convert the voice to be quality-checked into a text, so as to obtain the text to be detected.

Based on the text detection method and device for detecting the target phrase, the embodiment of the present invention may further include: and detecting whether the text to be detected contains sentences matched with the target sentence or not, and determining whether the text meets the requirements or not according to the matching result, namely detecting the target sentence. The sentence is generally composed of a plurality of phrases, and the detection of the target sentence can more effectively detect the semantics of the sentence. The method comprises the following specific steps:

firstly, acquiring a text to be detected.

And secondly, determining a sentence set to be detected in the text to be detected.

The sentences in the sentence set to be detected are determined based on target sentences set by a user, and the specific determination mode can be that the texts to be detected are screened based on the lengths of the target sentences, and sentences with matched sentence lengths are found and added into the sentence set to be detected; or the method can be used for screening by utilizing the symbolic words in the target sentence, determining the sentences containing the words as the sentences to be detected and adding the sentences to be detected into the sentence set to be detected.

The main purpose of the step is to reduce the calculation amount required for sentence matching in the subsequent process, and filter out sentences in the text to be detected, which have obviously different semantics from the target sentence, so as to improve the detection efficiency. Especially for some texts to be detected with long space, if sentence-by-sentence matching identification is performed on the texts to be detected and the target sentences, a large amount of computing resources are consumed, so that the matching time is too long, and the user experience is influenced.

Thirdly, semantic similarity between the sentences in the sentence set to be detected and the target sentences is detected item by utilizing a semantic similarity model.

The semantic similarity model is a data model obtained by training a large number of labeled semantic pairs by utilizing a neural network, and the semantic similarity between two sentences can be identified through the semantic similarity model. The labeled semantic pair refers to a group of sentences obtained through artificial or intelligent processing, and the labeling on the group of sentences in this embodiment may be: "semantically similar" or "semantically dissimilar", or labeling the semantic similarity values of two sentences. In this step, the specific way of training the semantic similarity model based on the neural network algorithm is not limited.

The sentence to be detected and the target sentence are input into the semantic similarity model, the semantic similarity model correspondingly outputs a detection result, and the detection result can be a judgment result of 'sentence approximation' or 'sentence non-approximation', or a calculation result of similarity values of the two sentences.

And fourthly, judging whether the text to be detected has sentences similar to the target sentences in semantics or not according to the detection result.

Specifically, for a text to be detected, after all sentences to be detected are detected in step 103, if there is a sentence similar to the semantics of the target sentence, it may be determined that the detected text contains the target sentence, and if there is no sentence similar to the semantics of the target sentence, it is determined that the detected text does not contain the target sentence. Of course, in practical applications, the specific judgment in this step may also be set to a specific judgment rule, for example, a threshold value of the number of sentences is set, and when the number of sentences with similar semantics reaches the threshold value, it is determined that the target sentence exists in the text to be detected.

And fifthly, determining whether the text to be detected meets the requirements or not according to the judgment result and the detection rule corresponding to the target sentence.

After determining whether the text to be detected contains sentences similar to the semantics of the target sentence, classifying the text to be detected based on the detection rule of the target sentence, if the text to be detected containing the target sentence meets the requirement, or if the text to be detected containing the target sentence does not meet the requirement, and the like.

In practical application, this step is specifically used to classify a large number of texts to be detected, i.e., texts meeting or not meeting the requirements are extracted. When determining whether a text to be detected meets the requirement, when determining a target sentence, the detection rule may be specifically divided into that when the determination result is that the sentences have similar semantics, the text to be detected meets the requirement, or when the matching result is unmatched, the text to be detected meets the requirement. In general, a user may set a plurality of target sentences to detect a text to be detected, and at this time, it is also necessary to determine whether the text to be detected meets requirements according to a preset determination rule, for example, when the text to be detected is successfully matched with any one of the target sentences, it is determined that the text meets the requirements; or, when the text to be detected is successfully matched with all the target sentences, the text is determined to meet the requirement.

It can be seen from the above embodiments that, in the text detection, the embodiments of the present invention mainly screen out the sentences to be detected from the text, then detect the semantics of the sentences to be detected and the semantics of the target sentences one by using the semantic similarity model, determine whether the text to be detected contains the sentences having semantics similar to the target sentences, and further determine whether the text to be detected meets the requirements according to the detection rules of the target sentences, thereby implementing the classification detection of the text. In the detection process, because the semantic similarity model is obtained based on neural network training, the text can be detected from semantic dimensionality, and the method is more consistent with judgment of human thinking compared with the existing method for identifying and detecting words based on target sentences, so that the accuracy of the text detection method provided by the embodiment of the invention is higher.

In addition, in another embodiment of the present invention, the text detection method may be further applied to the field of voice quality inspection, where the text to be detected is a text to be detected obtained after converting the voice to be detected into a text, and the industries corresponding to the field are mainly industries mainly based on voice services, such as artificial customer service in telephone and internet e-commerce, in which a large amount of voice data is generated every day, and in order to objectively evaluate the service level of the artificial customer service, it is necessary to detect the voice data and determine whether there is a symbolic sentence, i.e., a target sentence, which can be used for evaluation. And then evaluating the service process of the customer service staff based on the target sentence. The text detection process shown in fig. 1 can achieve the purpose of evaluating the speech of the customer service staff according to the evaluation criteria, i.e., the target sentence. In this application scenario, the speech needs to be converted into the text, but the specific manner of converting the speech into the text is not limited in this embodiment, and any existing speech recognition program can be used to convert the speech data into the text and generate the text to be detected.

Further, it can be seen from the above description of the embodiment of detecting the target sentence that how to determine the sentence to be detected from the text to be detected and obtain the sentence set to be detected will directly affect the final detection result of the text to be detected, and therefore, the second step will be specifically described below: the specific implementation scheme of determining the sentence set to be detected in the text to be detected comprises the following steps:

in the first scheme, the text to be detected is traversed word by word, and paragraphs with the length consistent with that of the target sentence are intercepted and put into the sentence set to be detected.

The method is simple to implement, the text to be detected is intercepted only according to the length of the target sentence, however, the obtained sentence to be detected may not be a complete sentence in the text, and the meaning of the sentence may also have ambiguity, so that the sentence to be detected determined by the method has semantic ambiguity, and a certain deviation exists in the detection result. Moreover, the number of sentences obtained by this method is also large, which is not favorable for improving the overall efficiency of text detection. However, the sentences obtained by the method have more comprehensive coverage compared with the texts to be detected due to the huge number of the sentences, and the text detection accuracy can be improved to a certain extent.

The second scheme includes that the target sentence is segmented, stop words are removed to obtain one or more target words, the one or more target words form a positioning word set, each positioning word in the positioning word set is searched for the positioning word from the text to be detected, and if the positioning word is found, a paragraph which contains the positioning word and is consistent with the length of the target sentence is intercepted from the text to be detected to serve as the sentence to be detected.

The method is an optimization processing method in the embodiment of the invention.

In the mode, the purpose of removing stop words is to obtain more accurate positioning words, some words with ambiguous positioning functions, such as 'you', 'I' and the like, may exist in the target sentence, and after the words are removed, the number of words in the positioning word set is reduced, so that when the positioning words are used for searching in the text to be detected, a large amount of computing resources can be saved, and the detection efficiency is improved.

When the positioning words are determined, synonyms (or similar synonyms) of part of target words can be further acquired, and the synonyms are added for expanding the semantics of the target words in the target sentences, so that the screening range of the sentences to be detected is expanded, and the sentences with similar semantics are prevented from being omitted. The synonym can be obtained by utilizing a word vector model, and the word vector model is a machine learning model for identifying the synonym based on neural network training. The word vector model identifies whether the semantemes of different words are similar or not by calculating the word vectors of the input participles, so that synonyms of the target words can be calculated and added into the positioning word set. In this embodiment, the number of the positioning words in the positioning word set may be limited, a number interval may be set, when the positioning words are lower than the minimum value of the interval, synonyms may be obtained by using the word vector model to supplement the number of the positioning words, and when the number of the positioning words is higher than the maximum value of the interval, synonyms of some target words may be removed correspondingly, so as to ensure that the positioning words set has a certain number of positioning words, improve the probability of matching valid sentences to be detected from the text to be detected, avoid the occurrence of full text detection of the text to be detected due to the fact that the positioning words are too few and the sentences to be detected are not matched, or the occurrence of excessive matching sentences to be detected due to the fact that the number of the positioning words is too large and the calculation amount is increased to reduce the detection efficiency, it may be seen that the accuracy of the matched sentences to be detected may be effectively improved by setting the number interval, meanwhile, the high efficiency of the detection process is ensured.

In addition, in this manner, another condition for determining the sentence to be detected according to the anchor word is that the length of the paragraph extracted from the text to be detected is consistent with the length of the target sentence, because in most cases, two sentences are to express the same semantic meaning, and often the words used in the two sentences are similar, but are substantially the same in number, so that the sentences with the same semantic meaning are substantially the same in sentence length. Further, in the intercepted sentence, the position of the positioning word in the sentence should be the same as the position of the positioning word in the target sentence.

For example, assume a target sentence ABC, wherein a sentence ADBCE is found when the positioning word C is searched in the text to be detected, the sentence ADBCE cannot be matched with the target sentence during detection, but needs to be intercepted, and the sentence to be detected is finally intercepted to be DBC according to the interception rule. When the positioning word F is used for searching, the F is a synonym of the C, a section of sentence RABFD is found, and finally the sentence to be detected is intercepted to be ABF.

The above describes how to determine the to-be-detected sentences in the to-be-detected text, generate the to-be-detected sentence set, and in combination with other steps of detecting the target sentence, in detail, when the text detection is performed, the embodiment of the present invention determines whether the to-be-detected text meets the requirements finally according to the detection result and the detection rule of the target sentence by selecting part of the to-be-detected sentences to perform semantic similarity detection with the target sentence.

Further, as an implementation of the method for detecting a text for a target sentence, an embodiment of the present invention further provides a text detection device, which is mainly used for detecting whether a text contains content matched with the target sentence, and classifying the text according to a determination result. For convenience of reading, details in the foregoing method embodiments are not described in detail again in this apparatus embodiment, but it should be clear that the apparatus in this embodiment can correspondingly implement all the contents in the foregoing method embodiments. The device specifically includes:

the text acquisition unit is used for acquiring a text to be detected;

a sentence determining unit, configured to determine a set of sentences to be detected in the text to be detected obtained by the text obtaining unit;

the similarity detection unit is used for detecting the semantic similarity between the sentences in the sentence set to be detected determined by the sentence determination unit and the target sentences one by utilizing a semantic similarity model;

a sentence detection unit, configured to determine whether a sentence with a semantic similar to the target sentence exists in the text to be detected according to the detection result obtained by the similarity detection unit;

and the text detection unit is used for determining whether the text to be detected meets the requirements or not according to the judgment result obtained by the sentence detection unit and the detection rule corresponding to the target sentence.

Further, the apparatus further comprises:

a sample acquisition unit for acquiring a plurality of labeled sentence pairs; wherein the labels are semantically similar or semantically dissimilar;

and the model training unit is used for training the semantic similarity model based on a large number of labeled sentence pairs acquired by the sample acquisition unit so that the similarity detection unit 23 can detect the semantic similarity between the sentences in the sentence set to be detected and the target sentences by applying the trained voice similarity model.

Further, the sentence determination unit includes:

and the sentence interception first module is used for traversing the text to be detected word by word, intercepting paragraphs with the length consistent with that of the target sentence, and putting the paragraphs into a sentence set to be detected.

Further, the sentence determination unit further includes:

the positioning word acquisition module is used for segmenting the target sentence, removing stop words to obtain one or more target words, and forming a positioning word set by the one or more target words;

and a sentence intercepting second module, configured to search each positioning word in the positioning word set acquired by the positioning word acquiring module for the positioning word from the text to be detected, and if the positioning word is found, intercept a paragraph, which contains the positioning word and has a length that is consistent with that of the target sentence, from the text to be detected as a sentence to be detected.

Further, the positioning word obtaining module is further configured to determine synonyms of at least some target words in the one or more target words; and forming a positioning word set by the one or more target words and the synonyms.

Further, the operation of the sentence interception second module in intercepting a paragraph containing the positioning word and having a length consistent with that of the target sentence from the text to be detected specifically includes:

and enabling the position of the positioning word in the intercepted paragraph to be consistent with the position of the positioning word or the synonym of the positioning word in the target sentence.

Further, the operation of the positioning word obtaining module in determining synonyms of at least some target words in the one or more target words specifically includes:

and obtaining synonyms of at least part of target participles by using a word vector model, wherein the word vector model is a machine learning model for identifying the synonyms, which is obtained based on neural network training.

Further, the text acquisition unit is specifically configured to convert the voice to be quality-checked into a text, so as to obtain the text to be detected.

Further, an embodiment of the present invention also provides a storage medium, where a computer program is stored on the storage medium, where the computer program is executed by one or more computing devices to implement the text detection method described above.

In addition, embodiments of the present invention also provide a system including one or more computing devices and one or more storage devices, on which a computer program is recorded, where the computer program, when executed by the one or more computing devices, causes the one or more computing devices to implement the text detection method described above.

In summary, the text detection method and apparatus provided in the embodiments of the present invention detect a text to be detected or a text obtained by converting speech data based on a data model trained by big data statistics and a neural network algorithm, determine whether the content of the text to be detected contains a content similar to the semantic of a target phrase set by a user, implement detection of the text to be detected from semantic dimensions, and improve the accuracy of text detection.

In the foregoing embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It will be appreciated that the relevant features of the method and apparatus described above are referred to one another. In addition, "first", "second", and the like in the above embodiments are for distinguishing the embodiments, and do not represent merits of the embodiments.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

The algorithms and displays presented herein are not inherently related to any particular computer, virtual machine, or other apparatus. Various general purpose systems may also be used with the teachings herein. The required structure for constructing such a system will be apparent from the description above. Moreover, the present invention is not directed to any particular programming language. It is appreciated that a variety of programming languages may be used to implement the teachings of the present invention as described herein, and any descriptions of specific languages are provided above to disclose the best mode of the invention.

In addition, the memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.

The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A text detection method, wherein the method comprises:

acquiring a text to be detected;

determining co-occurrence words of the target phrase, wherein the co-occurrence words refer to words or phrases which are used together with the target phrase or commonly occur with the target phrase in the text;

2. The method of claim 1, wherein determining paragraphs to be detected in the text to be detected using the co-occurrence words comprises:

3. The method of claim 2, wherein the determining a paragraph to be detected according to the position of the co-occurrence word in the text to be detected and a preset paragraph length comprises:

or,

4. The method of claim 1, wherein the detecting whether the text passage to be detected contains content matching the target phrase by using a machine learning model comprises:

5. The method of claim 1, wherein the determining co-occurring words of a target phrase comprises:

6. The method according to any one of claims 1-5, wherein the method further comprises:

7. The method according to any one of claims 1-5, wherein the obtaining text to be detected comprises:

8. A text detection apparatus, wherein the apparatus comprises:

the text acquisition unit is used for acquiring a text to be detected;

a co-occurrence word determining unit, configured to determine a co-occurrence word of the target phrase, where the co-occurrence word is a word or a phrase that is used in combination with the target phrase or that co-occurs with the target phrase in the text;

9. The apparatus of claim 8, wherein the paragraph determination unit comprises:

10. The apparatus of claim 9, wherein the first determining means comprises:

11. The apparatus of claim 8, wherein the content matching unit comprises:

12. The apparatus according to claim 8, wherein the co-occurrence word determining unit is specifically configured to obtain the corresponding co-occurrence word from a preset co-occurrence word list according to the target phrase, where the preset co-occurrence word list is obtained based on statistics of a text set in the same technical field as the text to be detected.

13. The apparatus of any of claims 8-12, wherein the apparatus further comprises:

14. The apparatus according to any one of claims 8 to 12, wherein the text acquisition unit is specifically configured to convert the speech to be quality-checked into a text to obtain the text to be detected.

15. A computer-readable storage medium, having a computer program stored thereon, wherein the computer program, when executed by one or more computing devices, implements the method of any of claims 1-7.

16. A system comprising one or more computing devices and one or more storage devices having a computer program recorded thereon, which, when executed by the one or more computing devices, causes the one or more computing devices to carry out the method of any one of claims 1-7.