CN109271489A

CN109271489A - A kind of Method for text detection and device

Info

Publication number: CN109271489A
Application number: CN201811247368.2A
Authority: CN
Inventors: 杨俊�; 陈立
Original assignee: 4Paradigm Beijing Technology Co Ltd
Current assignee: 4Paradigm Beijing Technology Co Ltd
Priority date: 2018-10-25
Filing date: 2018-10-25
Publication date: 2019-01-25
Anticipated expiration: 2038-10-25
Also published as: CN109271489B

Abstract

The invention discloses a kind of Method for text detection and devices, are related to text detection techniques field, main purpose is in the accuracy for improving text detection.The main technical solution of the present invention are as follows: obtain text to be detected；Determine the co-occurrence word of object phrase；Paragraph to be detected is determined in the text to be detected using the co-occurrence word；It whether is detected in the paragraph to be detected using machine learning model containing the content to match with the object phrase；Determine whether the text to be detected meets the requirements according to matching result and the corresponding detected rule of the object phrase.The present invention is used to detect the normalization of text.

Description

A kind of Method for text detection and device

Technical field

The present invention relates to text detection techniques field more particularly to a kind of Method for text detection and device.

Background technique

With the emergence and development of internet, the quantity of text documents on the internet develops swifter and more violent, text detection Technology has been widely used in the fields such as document retrieval, public sentiment management, by whether containing the short of user setting in detection text Whether the information such as language or sentence determine in text content containing data content needed for user.

Currently, common text detection is mostly that the keyword based on user setting is matched, or utilizes nature language Speech processing technique calculates the near synonym of multiple keywords, and carries out text detection using these words.However, in many situations Under, especially in the case where polysemy, the language of the keyword using existing detection mode due to being difficult to determine user setting Justice, it is even more impossible to be searched and matched by simple near synonym, therefore, the testing result obtained by existing detection mode according to So there are problems that accuracy is low, narrow application range.

Summary of the invention

In view of the above problems, the invention proposes a kind of Method for text detection and device, main purpose is to improve text The accuracy of detection.

In order to achieve the above objectives, present invention generally provides following technical solutions:

On the one hand, the present invention provides a kind of Method for text detection, specifically includes:

Obtain text to be detected；

Determine the co-occurrence word of object phrase；

Paragraph to be detected is determined in the text to be detected using the co-occurrence word；

It is detected in the paragraph to be detected in whether containing matching with the object phrase using machine learning model Hold；

Determine whether the text to be detected conforms to according to matching result and the corresponding detected rule of the object phrase It asks.

Preferably, determine that paragraph to be detected includes: in the text to be detected using the co-occurrence word

For each co-occurrence word of the object phrase, the co-occurrence word is searched in the text to be detected；

The co-occurrence word is often found, is determined according to position of the co-occurrence word in text to be detected and default bout length One paragraph to be detected；

If the object phrase is without co-occurrence word, alternatively, the object phrase has in co-occurrence word but the text to be detected not There are the co-occurrence words, then entire text to be detected are determined as paragraph to be detected.

Preferably, the position according to the co-occurrence word in text to be detected and default bout length determine one to Detection paragraph includes:

In text to be detected, the text of preset length is respectively taken before and after the co-occurrence word found, obtain one it is to be checked Survey paragraph；

Alternatively,

In text to be detected, length is taken to be equal to preset length and include the paragraph of the co-occurrence word found as to be checked Survey paragraph.

Preferably, whether described detected in the text fragment to be detected using machine learning model is contained and the target The content that phrase matches includes:

Utilize the term vector of phrase in paragraph to be detected described in term vector model inspection and the term vector of the object phrase Similarity whether reach preset similarity value；Wherein, the term vector model is obtained based on neural metwork training for knowing The model of other synonym；

If reaching, it is determined that contain the content to match with the object phrase in the text to be detected.

Preferably, the co-occurrence word of the determining object phrase includes:

Obtain corresponding co-occurrence word from preset co-occurrence vocabulary according to object phrase, the preset co-occurrence vocabulary be based on What the text collection in text same technique to be detected field counted.

Preferably, the method also includes:

When needing to identify the corresponding name entity of the object phrase, call Named Entity Extraction Model judgement described to It detects and whether there is name entity name corresponding with the object phrase in text；

If it is present the text to be detected meets the requirements；

If it does not exist, then the text to be detected is undesirable.

Preferably, the acquisition text to be detected includes:

Text will be converted to quality inspection voice, and obtain text to be detected.

On the other hand, the present invention provides a kind of text detection device, specifically includes:

Text acquiring unit, for obtaining text to be detected；

Co-occurrence word determination unit, for determining the co-occurrence word of object phrase；

Paragraph determination unit, the co-occurrence word for being obtained using the co-occurrence word determination unit is in the text acquiring unit Paragraph to be detected is determined in obtained text to be detected；

Content matching unit, for detecting the paragraph to be detected that the paragraph determination unit determines using machine learning model In whether contain the content that matches with the object phrase；

Text detection unit, the matching result and the object phrase for being obtained according to the content matching unit are corresponding Detected rule determine whether the text to be detected meets the requirements.

Preferably, the paragraph determination unit includes:

It is total to search this for each co-occurrence word for the object phrase in the text to be detected for searching module Existing word；

First determining module, when often finding the co-occurrence word for the searching module, according to the co-occurrence word to be detected Position and default bout length in text determine a paragraph to be detected；

Second determining module, if for the object phrase without co-occurrence word, alternatively, the object phrase has co-occurrence word but institute It states in text to be detected there is no when the co-occurrence word, entire text to be detected is determined as paragraph to be detected.

Preferably, first determining module includes:

First determines submodule, for respectively taking preset length before and after the co-occurrence word found in text to be detected Text, obtain a paragraph to be detected；

Second determination submodule, in text to be detected, taking length to be equal to preset length and including this found The paragraph of co-occurrence word is as paragraph to be detected.

Preferably, the content matching unit includes:

Term vector detection module, term vector and institute for phrase in paragraph to be detected described in utilization term vector model inspection Whether the similarity for stating the term vector of object phrase reaches preset similarity value；Wherein, the term vector model is based on nerve net The model for the synonym for identification that network training obtains；

Determining module, described in determining when the term vector detection module determines that similar value reaches preset similarity value Contain the content to match with the object phrase in text to be detected.

Preferably, the co-occurrence word determination unit is specifically used for, and is obtained from preset co-occurrence vocabulary according to object phrase pair The co-occurrence word answered, the preset co-occurrence vocabulary are to count to obtain based on the text collection with text same technique to be detected field 's.

Preferably, described device further include:

Entity detection unit is named, for calling and naming when needing to identify the corresponding name entity of the object phrase Entity recognition model judges in the text to be detected with the presence or absence of name entity name corresponding with the object phrase；If In the presence of then the text to be detected meets the requirements；If it does not exist, then the text to be detected is undesirable.

Preferably, the text acquiring unit is specifically used for, and will be converted to text to quality inspection voice, and obtain text to be detected This.

On the other hand, the present invention provides a kind of computer readable storage medium, wherein the computer readable storage medium On be stored with computer program, wherein the computer program realizes above-mentioned text when being executed by one or more computing devices This detection method.

On the other hand, the present invention provides a kind of is including one or more computing devices and one or more storage devices It unites, record has computer program on one or more of storage devices, and the computer program is one or more of Computing device makes one or more of computing devices realize above-mentioned Method for text detection when executing.

By above-mentioned technical proposal, a kind of Method for text detection and device provided by the invention are examined to text to be detected It is that paragraph to be detected is determined in text to be detected using the co-occurrence word of object phrase when surveying it and whether containing object phrase, with This determines the paragraph position that the word to match with object phrase is likely to occur, and realizes to the preliminary of the sentence in text to be detected Filtering treats detection paragraph using machine learning model in paragraph to be detected and carries out match query one by one, judgement is wherein later Whether containing the content to match with object phrase, and the language that detection can be directed to object phrase is carried out using machine learning model Justice carries out more accurate judgement, to improve the detection accuracy to text to be detected.

The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.

Detailed description of the invention

By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:

Fig. 1 shows a kind of flow chart of Method for text detection of proposition of the embodiment of the present invention；

Fig. 2 shows the flow charts for another Method for text detection that the embodiment of the present invention proposes；

Fig. 3 shows a kind of composition block diagram of text detection device of proposition of the embodiment of the present invention；

Fig. 4 shows the composition block diagram of another text detection device of proposition of the embodiment of the present invention.

Specific embodiment

The exemplary embodiment that the present invention will be described in more detail below with reference to accompanying drawings.Although showing the present invention in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the present invention without should be by embodiments set forth here It is limited.It is to be able to thoroughly understand the present invention on the contrary, providing these embodiments, and can be by the scope of the present invention It is fully disclosed to those skilled in the art.

Whether the embodiment of the invention provides a kind of Method for text detection, contain and object phrase phase for detecting in text Matched content, and determine whether text meets the requirements with judging result.This method specific steps are as shown in Figure 1, comprising:

101, text to be detected is obtained.

102, the co-occurrence word of object phrase is determined.

Wherein, co-occurrence word refers to often to arrange in pairs or groups with object phrase and use, or goes out jointly with object phrase in the text Existing some words or phrase.The quantity of determined co-occurrence word is not limited in the embodiment of the present invention, specific determination process can be with The co-occurrence word being manually entered by user, i.e., user specify object phrase co-occurrence word；It is also possible to analyze and unite by data Meter obtains co-occurrence word corresponding with object phrase, to directly acquire co-occurrence word, generally, is analyzed by data and is obtained with statistics Co-occurrence word be to be obtained by carrying out participle statistics to a large amount of texts, for example, statistics is in object phrase is adjacent or nearby sphere The word that frequency of occurrence is more than preset value is determined as the co-occurrence word of the object phrase by the word of interior appearance.

103, paragraph to be detected is determined in text to be detected using co-occurrence word.

Wherein, it is statement interlude in text to be detected, sentence that paragraph to be detected does not limit specifically in embodiments of the present invention Or phrase of multiple words etc..

The purpose of this step is to mark or filter out the paragraph containing co-occurrence word in text to be detected, needs to illustrate Be, due to co-occurrence word can be it is multiple, be also likely to be present multiple co-occurrence words in some paragraph to be detected.

104, it whether is detected in paragraph to be detected using machine learning model containing the content to match with object phrase.

In the embodiment of the present invention, which refers to is obtained using neural metwork training, can be identified and mesh The model of phrase or vocabulary that the semanteme of mark phrase matches.In practical applications, the input of the machine learning model is to be checked Survey paragraph and object phrase, export as matching result, i.e., paragraph to be detected match or is mismatched with object phrase, match illustrate to There is word identical or approximate with the semanteme of object phrase in detection paragraph, mismatch then illustrates all in text to be detected Word identical or approximate with the semanteme of object phrase is not present in paragraph to be detected.

105, determine whether text to be detected meets the requirements according to matching result and the corresponding detected rule of object phrase.

Specifically, this step is used to classify to a large amount of text to be detected, i.e., satisfactory Text Feature Extraction is gone out Come.And in practical applications, when determining whether a text to be detected meets the requirements, judged for an object phrase When, detected rule can also be specifically divided into when matching result is matches, which is to meet the requirements, either, When matching result is to mismatch, which is to meet the requirements.And under normal circumstances, multiple object phrases can be set Text to be detected is detected, at this time, it is also desirable to determine whether text to be detected conforms to according to preset judgment rule It asks, for example, determining that the text meets the requirements when text to be detected and any one object phrase successful match；Either, When text to be detected and all object phrase all successful match, just determine that the text meets the requirements.

It can be seen that the embodiment of the present invention when detecting text by above-mentioned embodiment shown in FIG. 1, mainly utilize mesh Co-occurrence word corresponding to mark phrase first carries out the positioning that check bit is set to text, and then using machine learning model to positioning Paragraph to be detected afterwards carries out semantic matches, to judge wherein with the presence or absence of the word to match with object phrase, finally, according to The detected rule of object phrase determines whether the text meets the requirements.The co-occurrence word as used in the program and engineering It practises model and the intellectualized algorithms such as big data statistics and neural network is utilized, so that when being detected to text, Neng Gougeng Whether the semanteme to accurately judge that out text is identical as the semanteme of object phrase, and then improves the accuracy of text detection.

Further, in another embodiment of the invention, text detection method is applied to the neck to speech detection When in domain, text to be detected therein be will after quality inspection voice is converted to text, obtain to quality inspection text, the field is main It is the industry based on voice service, such as artificial customer service in phone, network electric business can all generate a large amount of voice number daily According to, and in order to be objectively evaluated to the service level of artificial customer service, it is necessary to these voice data are detected, judge it In whether there is the significant speech that can be used for evaluating, i.e. object phrase.Service based on object phrase to contact staff again Cheng Jinhang evaluation.Based on application scenarios above-mentioned, the embodiment of the present invention passes through Method for text detection shown in Fig. 2 specifically in plain text The detailed process of this detection:

201, text is converted to quality inspection voice, obtains text to be detected.

Wherein, without limitation, any one existing speech recognition program can be used in the concrete mode that voice turns text Text is converted voice data into, text to be detected is generated.

202, corresponding co-occurrence word is obtained from preset co-occurrence vocabulary according to object phrase.

Specifically, object phrase is generally to judge the whether satisfactory actual conditions of text as set by user, For example, can be judged by object phrase customer service or consultant term whether profession, standard etc..Generally, target is short The setting quantity of language can be one or more, and when there is multiple object phrases, it will be one by one the acquisition pair of each object phrase The co-occurrence word answered forms co-occurrence set of words.

Preset co-occurrence vocabulary in the present embodiment be based on the text collection in text same technique to be detected field statistics It obtains.And the text collection in same technique field then can be the text for concentrating public data according to preset technical field Classify, technical field belonging to text to be detected is selected as user, so that it is determined that the text collection in same technique field, benefit Participle statistics is carried out with the text in text set, creates the corresponding preset co-occurrence vocabulary of the technical field, it is preset total at this Record has a participle in text set there are co-occurrence word in existing vocabulary, and the co-occurrence times or frequency of the co-occurrence word and participle should be big In preset value.

It should be noted that the preset co-occurrence vocabulary is changed with the variation of the content of text in text collection, Because with the addition of new text, the update of content, the co-occurrence word respectively segmented in preset co-occurrence vocabulary can also become therewith Change.Therefore, the preset co-occurrence vocabulary in this step is the tables of data with dynamic updating ability, is carried out when getting object phrase When text detection, the object phrase is searched from preset co-occurrence vocabulary in real time, and if it exists, corresponding co-occurrence word is then obtained, if not In the presence of then prompting the object phrase without co-occurrence word.

203, paragraph to be detected is determined in text to be detected using co-occurrence word.

Specifically, since there may be co-occurrence words in preset co-occurrence vocabulary for object phrase, it is also possible to co-occurrence word is not present, Therefore, this step will determine respectively paragraph to be detected for both of these case.

When firstly, for co-occurrence word corresponding there are object phrase, these co-occurrence words will be used seriatim in text to be detected This is searched, as soon as a co-occurrence word is often found, according to position of the co-occurrence word in text to be detected and default paragraph Length determines a paragraph to be detected.Wherein, presetting bout length is usually to be measured with character quantity, and think at this Contained word or phrase may have the phrase of approximate semanteme containing object phrase or with object phrase in default bout length.

In embodiments of the present invention, the feasible pattern for determining a paragraph to be detected according to default bout length includes:

First, in text to be detected, the text of preset length is respectively taken before and after the co-occurrence word found, obtains one Paragraph to be detected.

Second, in text to be detected, length is taken to be equal to preset length and include the paragraph work of the co-occurrence word found For paragraph to be detected.

Above-mentioned two ways difference is that position of the co-occurrence word in paragraph to be detected, the position of the first are relatively solid It is fixed, and second be it is unfixed, that is, extract and meet that length is identical as preset length, and the paragraph item containing the co-occurrence word The all possible paragraph interception way of part.For example, co-occurrence word is A, preset length is 3 letters, and is contained in text to be detected A paragraph for having A is BCADF, then, if it is 1 letter, the first side that the first front and back, which respectively takes preset length all, The paragraph to be detected that formula obtains is CAD, and the paragraph to be detected that the second way obtains is tri- sections of BCA, CAD, ADF.

Secondly, the case where for co-occurrence word corresponding there is no object phrase.At this point, being then that entire text to be detected is true It is set to paragraph to be detected.

In addition, for there are the corresponding co-occurrence words of object phrase, but do not find the co-occurrence word in text to be detected The case where, at this point, and entire text to be detected is determined as paragraph to be detected.

204, it whether is detected in paragraph to be detected using machine learning model containing the content to match with object phrase.

Specific in the present embodiment, which uses term vector model, which is based on nerve The model for the synonym for identification that network training obtains.In the training term vector model, need to utilize a large amount of corpus number According to being segmented, and neural network model is constructed, training obtains term vector, in actual use, by point in paragraph to be detected Word or phrase and object phrase input the term vector model, and term vector model exports the term vector of each participle, and and object phrase Term vector be compared, i.e., calculating term vector between similarity whether reach preset similarity value, if reaching, it is determined that Containing the content to match with object phrase in detection text, the matching result of term vector model output at this time is successful match；And If all paragraphs to be detected are when it fails to match with the object phrase in text to be detected, it is to be checked for exporting matching result It surveys text and does not contain object phrase.

In one embodiment of the invention, the content to match with the phrase in quality inspection entry refer to synonymous phrase ( Referred to as close adopted phrase).For example, the phrase in quality inspection entry is " thanks ", quality inspection rule is to require the phrase that must occur, then exists " thanks " and its synonymous phrase are searched in paragraph to be detected, such as " thanks ".

205, determine whether text to be detected meets the requirements according to matching result and the corresponding detected rule of object phrase.

This step is identical as the step 105 in Fig. 1, and details are not described herein again.

206, entity is named to text to be detected using Named Entity Extraction Model to detect.

Wherein, the detection of name entity is the further detection carried out for particular content corresponding to object phrase, for example, When object phrase is " identification card number ", corresponding name entity is specific 18 bit digital or letter；When object phrase is When " posting address ", corresponding name entity is the information such as specific address information, including area, street, number.

This is because in the insurance scenes such as telemarketing, need telemarketing personnel with client confirm insurer, by Whens beneficial people and address etc., it is necessary to client be allowed to say actual name and address.By phone vehicle insurance marketing for, vehicle insurance sale at After function, when business personnel needs and insurer confirms the information such as name address, if only simply going to judge whether with " address " Have and confirm these information, in practice with it is possible that business personnel only says and " do not change address with insurer's confirmation ? ", confirmation address is not gone to practically, and these open information can not go to match with a sentence, therefore feasible Whether method is comprising effective address after identifying phrase, this just needs to be judged using Named Entity Extraction Model.

Specifically, when executing this step, need before detection in setting detection configuration file setting specifically need into The contents such as the object phrase and the corresponding identification format for naming entity information of row name entity detection.When actually executing, Named Entity Extraction Model will be called to judge to whether there is name entity name corresponding with the object phrase in text to be detected, Such as name, place name, mechanism name, and if it exists, then determine that the text to be detected meets the requirements；It is on the contrary, it is determined that text to be detected It is undesirable.

It should be noted that Named Entity Extraction Model is similar with above-mentioned term vector model, it is all to utilize neural network Content in the obtained text for identification of training whether include legal corresponding name entity model.Its training method is to adopt With existing neural network model using there is the sample of mark to be trained, the present embodiment does not limit for concrete mode.

Further, this step is in a possible embodiments of the invention, it may also be used for is determined for compliance in step 205 It is required that text to be detected carry out further name entity detection, when having the corresponding name of object phrase in text to be detected When entity name, determines that the text to be detected meets the requirements, otherwise determine that its is undesirable.

More than, text detection device provided in an embodiment of the present invention is described in detail by Fig. 2 and is being applied to voice data Detection when can equally play and improve to the accuracy of voice data quality testing, for the quality inspection entry (mesh of user setting Mark phrase), can accurately identify in voice data whether contain mesh by the cooperation detection of co-occurrence word and term vector model Mark phrase or with the semantic approximate content of the object phrase, also, the present invention also further utilizes Named Entity Extraction Model pair Content in text to be detected is detected, to realize in a certain fixed range after detected object phrase with the presence or absence of corresponding Legitimate name entity, for voice data, so that it may judge in voice data with the presence or absence of necessary name entity number According to detect art plasticity if contact staff or consultant.

Further, as the realization to above-mentioned Method for text detection, the embodiment of the invention provides a kind of text detections Whether device, the device contain the content to match with object phrase mainly for detection of in text, and with judging result to text This is classified.To be easy to read, present apparatus embodiment no longer goes to live in the household of one's in-laws on getting married one by one to the detail content in preceding method embodiment It states, it should be understood that the device in the present embodiment can correspond to the full content realized in preceding method embodiment.The device is such as Shown in Fig. 3, specifically include:

Text acquiring unit 31, for obtaining text to be detected；

Co-occurrence word determination unit 32, for determining the co-occurrence word of object phrase；

Paragraph determination unit 33, the co-occurrence word for being obtained using the co-occurrence word determination unit 32 are obtained in the text Paragraph to be detected is determined in the text to be detected that unit 31 obtains；

Content matching unit 34, for detecting the to be detected of the determination of paragraph determination unit 33 using machine learning model Whether contain the content to match with the object phrase in paragraph；

Text detection unit 35, matching result and the object phrase for being obtained according to the content matching unit 34 Corresponding detected rule determines whether the text to be detected meets the requirements.

Further, as shown in figure 4, the paragraph determination unit 33 includes:

Searching module 331, for each co-occurrence word for the object phrase, searching in the text to be detected should Co-occurrence word；

First determining module 332 exists when often finding the co-occurrence word for the searching module 331 according to the co-occurrence word Position and default bout length in text to be detected determine a paragraph to be detected；

Second determining module 333, if for the object phrase without co-occurrence word, alternatively, the object phrase has co-occurrence word But when the searching module 331 does not find the co-occurrence word in text to be detected, by entire text to be detected be determined as to Detection paragraph.

Further, as shown in figure 4, first determining module 332 includes:

First determines submodule 3321, for respectively taking before and after the co-occurrence word found default in text to be detected The text of length obtains a paragraph to be detected；

Second determines submodule 3322, in text to be detected, taking length to be equal to preset length and including to find The co-occurrence word paragraph as paragraph to be detected.

Further, as shown in figure 4, the content matching unit 34 includes:

Term vector detection module 341, for the term vector using phrase in paragraph to be detected described in term vector model inspection Whether reach preset similarity value with the similarity of the term vector of the object phrase；Wherein, the term vector model is based on mind The model of the synonym for identification obtained through network training；

Determining module 342, for when the term vector detection module 341 determine similar value reach preset similarity value when, really Contain the content to match with the object phrase in the fixed text to be detected.

Further, the co-occurrence word determination unit 32 is specifically used for, and is obtained from preset co-occurrence vocabulary according to object phrase Corresponding co-occurrence word is taken, the preset co-occurrence vocabulary is counted based on the text collection with text same technique to be detected field It arrives.

Further, as shown in figure 4, described device further include:

Entity detection unit 36 is named, for calling and ordering when needing to identify the corresponding name entity of the object phrase Name entity recognition model judges in the text to be detected with the presence or absence of name entity name corresponding with the object phrase；Such as Fruit exists, then the text detection unit 35 determines that text to be detected meets the requirements；If it does not exist, then the text detection list Member 35 determines that text to be detected is undesirable.

Further, the text acquiring unit 31 is specifically used for, and will be converted to text to quality inspection voice, and obtain to be detected Text.

Based on the above-mentioned Method for text detection and device detected for object phrase, the embodiment of the present invention can also be into One step includes: whether to contain the sentence to match with target sentences in detection text to be detected, and determine text with matching result Whether meet the requirements, i.e., method target sentences detected.Wherein, sentence is usually and is made of multiple phrases, and right Target sentences are detected, and can more effectively be detected for the semanteme of sentence.This method specific steps include:

The first, text to be detected is obtained.

The second, the sentence set to be detected in text to be detected is determined.

Wherein, the sentence in sentence set to be detected is based on determined by the target sentences of user setting, specifically really Determining mode can be the length based on target sentences, be screened in text to be detected, find out what sentence length matched Sentence is added in sentence set to be detected；It is also possible to be screened using the significant word in target sentences, will contains The sentence of these words is determined as sentence to be detected and is added in sentence set to be detected.

The main purpose of this step is calculation amount required when reducing subsequent progress sentence matching, filters out text to be detected In the visibly different sentence of semanteme with target sentences, to improve detection efficiency.Especially for some lengths it is longer to Text is detected, if to carry out match cognization with target sentences sentence by sentence, it will a large amount of computing resource is consumed, when leading to matching Between it is too long, influence user experience.

Third, the semanteme for detecting sentence and target sentences in sentence set to be detected one by one using Semantic Similarity model Similitude.

Wherein, Semantic Similarity model is using neural network to largely there is the semantic to obtaining after being trained of mark Data model can identify the Semantic Similarity between two sentences by the Semantic Similarity model.And there is the semanteme of mark It, in the present embodiment can be with to the mark of this group of sentence to the one group of sentence referred to by being obtained after artificial or intelligent processing Are as follows: " semantic similar " or " semantic dissimilar " is also possible to mark the semantic similarity value of two sentences.To being based in this step The concrete mode of neural network algorithm training Semantic Similarity model is without limitation.

By the way that sentence to be detected and target sentences are inputted Semantic Similarity model, which will be corresponding defeated A testing result out, the testing result can be " sentence is approximate " or " sentence is not approximate " judging result, be also possible to obtain The calculated result of the similarity value of two sentences.

4th, judge according to testing result in text to be detected with the presence or absence of the similar sentence of semanteme with target sentences.

Specifically, text to be detected for one, when all sentences to be detected are all after the detection of step 103, such as Fruit wherein exists and can then determine with sentence as the semantic similarity of target sentences and contain target sentences in the detection text, and its In there is no determine in the detection text with sentence as the semantic similarity of target sentences without containing target sentences.Certainly, in reality In the application of border, the specific judgement of this step can also set specific judgment rule, and the threshold value of sentence quantity is such as arranged, works as semanteme When similar sentence quantity reaches the threshold value, it is determined that there are target sentences in the text to be detected.

5th, determine whether the text to be detected conforms to according to judging result and the corresponding detected rule of target sentences It asks.

After whether containing the similar sentence of semanteme with target sentences in determining text to be detected, based on target sentences Detected rule, such as text to be detected containing target sentences meet the requirements, or the text to be detected containing target sentences is not It meets the requirements, classifies to text to be detected.

In practical applications, this step is then specifically used for classifying to a large amount of text to be detected, i.e., will meet the requirements Or undesirable Text Feature Extraction comes out.And when determining whether a text to be detected meets the requirements, for a target When sentence is judged, detected rule can also be specifically divided into when judging result is that sentence semantics are similar, the text to be detected This is meets the requirements, and either, when matching result is to mismatch, which is to meet the requirements.And in ordinary circumstance Under, user can be provided with multiple target sentences and detect to text to be detected, at this time, it is also desirable to according to preset judgment rule Determine whether text to be detected meets the requirements, for example, when text to be detected and any one target sentences successful match, just Determine that the text meets the requirements；Either, when text to be detected and all target sentences all successful match, the text is just determined It meets the requirements.

It, mainly will be in text by the above embodiments as can be seen that the embodiment of the present invention is when carrying out text detection Sentence filter out sentence to be detected, later using Semantic Similarity model one by one to the semanteme and target sentences of sentence to be detected Semanteme detected, determine in text to be detected whether containing with the semantic similar sentence of target sentences, and then according to mesh The detected rule of sentence is marked to judge whether the text to be detected meets the requirements, realizes and text classification is detected.In the detection Cheng Zhong can examine text from semantic dimension since Semantic Similarity model is obtained based on neural metwork training It surveys, the relatively existing participle based in target sentences carries out the mode of recognition detection, is more in line with the judgement of human thinking, makes The accuracy for obtaining the Method for text detection of the embodiment of the present invention is higher.

In addition, in another embodiment of the invention, further text detection method can also be applied to voice In the field of quality inspection, and text to be detected therein be then will after quality inspection voice is converted to text, obtain to quality inspection text, The corresponding industry in the field is mainly the industry based on voice service, such as artificial customer service in phone, network electric business, at this In a little industries, a large amount of voice data can be all generated daily, and in order to be objectively evaluated to the service level of artificial customer service, just It needs to detect these voice data, whether judgement wherein has the significant sentence that can be used for evaluating, i.e. target sentences. It is evaluated again based on service process of the target sentences to contact staff.And above-mentioned text detection process shown in FIG. 1 can The voice to contact staff is realized according to evaluation criterion, i.e. target sentences, the purpose evaluated.Under the application scenarios, need Convert speech into text, and the concrete mode that qualifier sound does not turn text in the present embodiment can be used it is existing any one Kind speech recognition program converts voice data into text, generates text to be detected.

Further, it how can be seen that by the explanation to the above-mentioned embodiment for object statement detection to be checked It surveys and determines sentence to be detected in text, final detection to the text to be detected will be directly influenced by obtaining sentence set to be detected As a result, therefore, will be explained in detail second step below: the specific implementation of " determining the sentence set to be detected in text to be detected " Scheme:

The first scheme, word for word traverses text to be detected, and the paragraph consistent in length of intercepted length and target sentences is put into In sentence set to be detected.

Which is relatively simple in realization, intercepts according only to the length of target sentences to text to be detected, however, Obtained sentence to be detected may not be the complete words in text, and sentence meaning is also likely to be present ambiguity, therefore, use The sentence to be detected that this mode determines can have ambiguity semantically, and leading to testing result, there are certain deviations.Also, pass through The quantity for the sentence that this mode obtains also can be relatively more, are unfavorable for improving the whole efficiency of text detection.But pass through the party The sentence that formula obtains just has more comprehensive spreadability, to a certain degree due to its substantial amounts for text relatively to be detected Upper is also that can promote text detection accuracy.

Second scheme segments target sentences, one or more target words is obtained after removing stop words, by one Or multiple target word composition positioning set of words, for each positioning word in positioning set of words, searching from text to be detected should Word is positioned, if found, interception includes the positioning word from text to be detected and length is consistent with the length of target sentences Paragraph as a sentence to be detected.

Which is the optimization processing mode in the embodiment of the present invention.

In this approach, the purpose for removing stop words is more accurately to position word in order to obtain, may in target sentences There are the indefinite words of some positioning functions, and such as " you ", " I " etc. positions set of words for reducing after the removal of these words In word quantity, in this way, using positioning word searched in text to be detected when, a large amount of computing resource can also be saved, Promote detection efficiency.

Which determine position word when, can also the synonym of further fetching portion target word (or be nearly justice Word), the purpose that synonym is added is semanteme in order to extend target word in target sentences, to expand the screening of sentence to be detected Range prevents from omitting semantic similar sentence.The synonym can use the acquisition of term vector model, which is base In the machine learning model for the synonym for identification that neural metwork training obtains.The term vector model is by calculating input participle Term vector, to identify whether the semanteme between different words similar, and then the synonym of target word can be calculated, and by this Positioning set of words is added in a little synonyms together.In the present embodiment, the positioning word quantity in the positioning set of words can be limited It is fixed, a quantity section is set, when positioning minimum value of the word lower than the section, just obtains synonym using term vector model, To supplement the quantity of positioning word, and when the quantity for positioning word is higher than the maximum value in the section, then it is therein can to correspond to removal The synonym of partial target word, be arranged such, it can be ensured that positioning set of words in have a certain number of positioning words, improve to The probability that effective sentence to be detected is matched in detection text is not in match since positioning word quantity is very few less than to be checked Survey sentence, result in the need for the detection of the full text of text to be detected, or due to positioning word quantity it is excessive and match excessively to Sentence is detected, causes calculation amount to increase and reduce detection efficiency, it is seen then that can effectively improve matching by the way that the quantity section is arranged The precision of sentence to be detected out, while ensure that the high efficiency of detection process.

In addition, in this approach, determining that another condition of sentence to be detected is from text to be detected according to positioning word The bout length of interception is consistent with the length of target sentences, this is because in most cases two sentences to express it is identical Semanteme, be in its word used often it is similar, and in quantity and roughly the same, it is therefore, semantic identical Sentence, it is generally identical in sentence length, be based on the logic, in the embodiment of the present invention when determining sentence to be detected, In addition to judging in sentence containing positioning word, it is also necessary to which the length of length of the judgement containing positioning words and phrases and target sentences should be big It causes identical.Further, in the sentence intercepted, position Ying Yuqi position in target sentences of the word in sentence is positioned It sets identical.

As an example it is assumed that target sentences ABC, wherein when searching in text to be detected to position word C, have found one Section sentence ADBCE, then, this section of sentence can not be matched with target sentences, but be needed it when being detected It is intercepted, according to above-mentioned interception rule, the sentence to be detected finally intercepted out is DBC.And when being searched with positioning word F, F For the synonym of C, the one section of sentence RABFD found, the sentence to be detected finally intercepted out is ABF.

How details described above determines the sentence to be detected in text to be detected, generates sentence set to be detected, ties Other steps for target sentences detection are closed, the embodiment of the present invention is described in detail when executing text detection, passes through selection Part sentence to be detected and target sentences carry out the detection of Semantic Similarity, and according to testing result and the detection of target sentences is advised Then finally determine whether text to be detected meets the requirements.

Further, as to the above-mentioned realization for carrying out Method for text detection for target sentences, the embodiment of the present invention is also Provide a kind of text detection device, the device mainly for detection of in text whether containing matching with target sentences in Hold, and is classified with judging result to text.To be easy to read, present apparatus embodiment is no longer in preceding method embodiment Detail content is repeated one by one, is realized in preceding method embodiment it should be understood that the device in the present embodiment can correspond to Full content.The device specifically includes:

Text acquiring unit, for obtaining text to be detected；

Sentence determination unit, for determining the sentence collection to be detected in text to be detected that the text acquiring unit obtains It closes；

Similitude detection unit, for using Semantic Similarity model detect that the sentence determination unit determines one by one to Detect the Semantic Similarity of the sentence and target sentences in sentence set；

Sentence detection unit, the testing result for being obtained according to the similitude detection unit judge the text to be detected With the presence or absence of the similar sentence of semanteme with the target sentences in this；

Text detection unit, the judging result and the target sentences for being obtained according to the sentence detection unit are corresponding Detected rule determine whether the text to be detected meets the requirements.

Further, described device further include:

Sample acquisition unit, for obtaining the sentence pair for largely having mark；Wherein, described to be labeled as semantic similar or language It is adopted dissimilar；

Model training unit, the sentence pair for largely having mark for being obtained based on the sample acquisition unit, training institute Predicate justice Similarity Model, so as to the good phonetic similarity model inspection sentence to be detected of 23 application training of similitude detection unit The Semantic Similarity of sentence and target sentences in set.

Further, the sentence determination unit includes:

Sentence intercepts the first module, for word for word traversing the text to be detected, intercepted length and the target sentences Paragraph consistent in length is put into sentence set to be detected.

Further, the sentence determination unit further include:

It positions word and obtains module, for segmenting to the target sentences, obtain one or more after removing stop words Target word forms positioning set of words by one or more of target words；

Sentence intercepts the second module, for obtaining each positioning word in the positioning set of words that module obtains to positioning word, The positioning word is searched from text to be detected, if found, interception includes the positioning word and length from text to be detected Paragraph consistent in length with the target sentences is as a sentence to be detected.

Further, the positioning word obtains module and is also used to, and determines at least portion in one or more of target words The synonym of partial objectives for word；Positioning set of words is formed by one or more of target words and the synonym.

Further, it includes the positioning word and length that the sentence, which intercepts the second module executing the interception from text to be detected, The operation with the paragraph consistent in length of the target sentences is spent to specifically include:

The position of the positioning word in the paragraph intercepted is enabled, with the synonym of the positioning word or the positioning word in target sentence Position consistency in son.

Further, the positioning word obtains module and determines in one or more of target words at least partly in execution The operation of the synonym of target word specifically includes:

The synonym of at least partly target participle is obtained using term vector model, the term vector model is based on mind The machine learning model of the synonym for identification obtained through network training.

Further, the text acquiring unit is specifically used for, and will be converted to text to quality inspection voice, and obtain text to be detected This.

Further, it the embodiment of the invention also provides a kind of storage medium, is deposited on the computer readable storage medium Contain computer program, wherein the computer program realizes above-mentioned text inspection when being executed by one or more computing devices Survey method.

In addition, including one or more computing devices and one or more storage dresses the embodiment of the invention also provides one kind The system set, record has computer program on one or more of storage devices, and the computer program is one Or multiple computing devices make one or more of computing devices realize above-mentioned Method for text detection when executing.

In conclusion a kind of Method for text detection and device that the embodiment of the present invention proposes, be counted based on big data with And the data model that neural network algorithm trains is detected to text to be detected or by the text that voice data is converted to, Judge to realize and treat from semantic dimension whether containing the similar content of semanteme with the object phrase of user setting in its content Detection text is detected, and the accuracy of text detection is improved.

In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.

It is understood that the correlated characteristic in the above method and device can be referred to mutually.In addition, in above-described embodiment " first ", " second " etc. be and not represent the superiority and inferiority of each embodiment for distinguishing each embodiment.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

Algorithm and display are not inherently related to any particular computer, virtual system, or other device provided herein. Various general-purpose systems can also be used together with teachings based herein.As described above, it constructs required by this kind of system Structure be obvious.In addition, the present invention is also not directed to any particular programming language.It should be understood that can use various Programming language realizes summary of the invention described herein, and the description done above to language-specific is to disclose this hair Bright preferred forms.

In addition, memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM), memory includes extremely A few storage chip.

It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.

The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

In a typical configuration, calculating equipment includes one or more processors (CPU), input/output interface, net Network interface and memory.

Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, such as read-only memory (ROM) or flash memory (flash RAM).Memory is computer-readable Jie The example of matter.

Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitory media), such as the data-signal and carrier wave of modulation.

It should also be noted that, the terms "include", "comprise" or its any other variant are intended to nonexcludability It include so that the process, method, commodity or the equipment that include a series of elements not only include those elements, but also to wrap Include other elements that are not explicitly listed, or further include for this process, method, commodity or equipment intrinsic want Element.In the absence of more restrictions, the element limited by sentence "including a ...", it is not excluded that including element There is also other identical elements in process, method, commodity or equipment.

It will be understood by those skilled in the art that embodiments herein can provide as method, system or computer program product. Therefore, complete hardware embodiment, complete software embodiment or embodiment combining software and hardware aspects can be used in the application Form.It is deposited moreover, the application can be used to can be used in the computer that one or more wherein includes computer usable program code The shape for the computer program product implemented on storage media (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) Formula.

The above is only embodiments herein, are not intended to limit this application.To those skilled in the art, Various changes and changes are possible in this application.It is all within the spirit and principles of the present application made by any modification, equivalent replacement, Improve etc., it should be included within the scope of the claims of this application.

Claims

1. a kind of Method for text detection, wherein the described method includes:

Obtain text to be detected；

Determine the co-occurrence word of object phrase；

It whether is detected in the paragraph to be detected using machine learning model containing the content to match with the object phrase；

Determine whether the text to be detected meets the requirements according to matching result and the corresponding detected rule of the object phrase.

2. according to the method described in claim 1, wherein, using the co-occurrence word, determination is to be detected in the text to be detected Paragraph includes:

The co-occurrence word is often found, determines one according to position of the co-occurrence word in text to be detected and default bout length Paragraph to be detected；

If the object phrase is not present without co-occurrence word alternatively, the object phrase has in co-occurrence word but the text to be detected Entire text to be detected is then determined as paragraph to be detected by the co-occurrence word.

3. according to the method described in claim 2, wherein, the position according to the co-occurrence word in text to be detected and pre- If bout length determines a paragraph to be detected and includes:

In text to be detected, the text of preset length is respectively taken before and after the co-occurrence word found, obtains one to be detected section It falls；

Alternatively,

In text to be detected, length is taken to be equal to preset length and include the paragraph of the co-occurrence word found as to be detected section It falls.

4. according to the method described in claim 1, wherein, the utilization machine learning model detects the text fragment to be detected In whether containing the content that matches with the object phrase include:

Utilize the phase of the term vector and the term vector of the object phrase of phrase in paragraph to be detected described in term vector model inspection Whether reach preset similarity value like degree；Wherein, the term vector model be obtained based on neural metwork training it is same for identification The model of adopted word；

5. according to the method described in claim 1, wherein, the co-occurrence word of the determining object phrase includes:

Obtain corresponding co-occurrence word from preset co-occurrence vocabulary according to object phrase, the preset co-occurrence vocabulary be based on it is to be checked What the text collection in survey text same technique field counted.

6. being used according to right and requiring method described in any one of 1-5, wherein the method also includes:

When needing to identify the corresponding name entity of the object phrase, Named Entity Extraction Model is called to judge described to be detected It whether there is name entity name corresponding with the object phrase in text；

If it is present the text to be detected meets the requirements；

If it does not exist, then the text to be detected is undesirable.

7. according to right with requiring method described in any one of 1-5, wherein described to obtain text to be detected and include:

8. a kind of text detection device, wherein described device includes:

Text acquiring unit, for obtaining text to be detected；

Paragraph determination unit, the co-occurrence word for being obtained using the co-occurrence word determination unit are obtained in the text acquiring unit Text to be detected in determine paragraph to be detected；

Content matching unit is for being detected in the paragraph to be detected that the paragraph determination unit determines using machine learning model It is no to contain the content to match with the object phrase；

Text detection unit, matching result and the corresponding inspection of the object phrase for being obtained according to the content matching unit Gauge then determines whether the text to be detected meets the requirements.

9. a kind of computer readable storage medium, wherein it is stored with computer program on the computer readable storage medium, In, side described in any one of claim 1-7 is realized when the computer program is executed by one or more computing devices Method.

10. a kind of system including one or more computing devices and one or more storage devices, one or more of to deposit Record has computer program on storage device, and the computer program makes institute when being executed by one or more of computing devices It states one or more computing devices and realizes such as method of any of claims 1-7.