CN109460461A - Text matching technique and system based on text similarity model - Google Patents

Text matching technique and system based on text similarity model Download PDF

Info

Publication number
CN109460461A
CN109460461A CN201811344782.5A CN201811344782A CN109460461A CN 109460461 A CN109460461 A CN 109460461A CN 201811344782 A CN201811344782 A CN 201811344782A CN 109460461 A CN109460461 A CN 109460461A
Authority
CN
China
Prior art keywords
text
default
similarity
string
sentence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811344782.5A
Other languages
Chinese (zh)
Inventor
朱钦佩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
AI Speech Ltd
Original Assignee
AI Speech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by AI Speech Ltd filed Critical AI Speech Ltd
Priority to CN201811344782.5A priority Critical patent/CN109460461A/en
Publication of CN109460461A publication Critical patent/CN109460461A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems

Abstract

The embodiment of the present invention provides a kind of text matching technique based on text similarity model.This method comprises: receiving text information, the feature vector of text information is determined, wherein feature vector includes at least: text-string, text phonetic, term vector;In the text similarity model that feature vector is input to;Obtain the characteristic similarity of text similarity model output;Determine that at least one reaches the default sentence of default characteristic threshold value using the matched text as text information according to characteristic similarity.The embodiment of the present invention also provides the training method and system of a kind of the text matches system based on text similarity model and text similarity model.The embodiment of the present invention determines the characteristic similarity of each default sentence in user's read statement and text similarity model by using the text similarity model for considering a variety of dimensional characteristics vectors, and then determines relatively accurate higher matched text.

Description

Text matching technique and system based on text similarity model
Technical field
The present invention relates to natural language processing field more particularly to a kind of text matches sides based on text similarity model Method and system.
Background technique
Text similarity computing is the basic problem of natural language processing, requires text similarity algorithm in many fields As support.In life, due to the description of user's colloquial style, the use of input method or hand mistake etc., the description of user is simultaneously Will not as document standard, but still imply the information that user wants in the text of user's description, accurate paving is grasped These Weak Informations, it is necessary to use text similarity measurement algorithm.For example, user's input " putting up a bridge somewhere in the Changjiang river ", in fact User really wants to ask " Yangtze Bridge is somewhere ".How according to " putting up a bridge somewhere in the Changjiang river ", in default corpus " Yangtze Bridge " is searched out, is the important application scene of text similarity measurement algorithm.For another example, user, which says, " navigates to north doctor six Institute ", " north doctor six institutes " how to be said according to user search out " the 6th hospital, Peking University " in default corpus.In order to solve These problems are generally indicated the height of text similarity using the number of word similar between calculating character string, or used Statistical model carries out text similarity statistics according to multiple words that user carries out in primary dialogue, or artificially collects, to locate Manage these problems.
In realizing process of the present invention, at least there are the following problems in the related technology for inventor's discovery:
It is although able to solve subproblem using the number of word similar between calculating character string, but for because of misspelling Similar Text caused by accidentally is difficult effectively to identify, for example, " Chiba hand-pulled noodles " (qian ye la mian) and " drawing of taste thousand can be obtained The similarity ratio " dangerous hand-pulled noodles " (wei xian la mian) and " thousand hand-pulled noodles of taste " (wei in face " (wei qian la mian) Qian la mian) similarity it is higher.And (such as the various inputs of session sampling instrument are often relied on using statistical model Method, search engine), covering surface is small, and artificially collects higher cost.
Summary of the invention
In order at least solve only to consider in the prior art between character string that similarity is not caused by the number of similar word Accurately or statistical method covering surface is small, artificially collects problem at high cost.
In a first aspect, the embodiment of the present invention provides a kind of training method of text similarity model, comprising:
It receives dictionary training set and the default sentence is determined to default sentence word segmentation processing each in the dictionary training set Text-string;
According to the text-string of each default sentence, determine term vector corresponding with the text-string and with institute State the corresponding text phonetic of text-string;
According to the corresponding text-string of each default sentence, text phonetic and term vector, determine described each default The corresponding feature vector of sentence, training text similarity model.
Second aspect, the embodiment of the present invention provide a kind of text matching technique based on text similarity model, comprising:
Text information is received, determines the feature vector of the text information, wherein described eigenvector includes at least: text This character string, text phonetic, term vector;
Described eigenvector is input in the text similarity model;
Obtain the characteristic similarity of the text similarity model output;
Determine that at least one reaches the default sentence of default characteristic threshold value using as the text according to the characteristic similarity The matched text of this information.
The third aspect, the embodiment of the present invention provide a kind of training system of text similarity model, comprising:
Text-string determines program module, for receiving dictionary training set, to each default language in the dictionary training set Sentence word segmentation processing, determines the text-string of the default sentence;
Term vector and text phonetic determine program module, for the text-string according to each default sentence, determining and institute State the corresponding term vector of text-string and text phonetic corresponding with the text-string;
Text similarity model training program module, for according to the corresponding text-string of each default sentence, text This phonetic and term vector determine the corresponding feature vector of each default sentence, training text similarity model.
Fourth aspect, the embodiment of the present invention provide a kind of text matches system based on text similarity model, comprising:
Feature vector determines program module, for receiving text information, determines the feature vector of the text information, In, described eigenvector includes at least: text-string, text phonetic, term vector;
Feature vector inputs program module, for described eigenvector to be input in the text similarity model;
Characteristic similarity obtains program module, for obtaining the characteristic similarity of the text similarity model output;
Text matches program module, for determining that at least one reaches default characteristic threshold value according to the characteristic similarity Sentence is preset using the matched text as the text information.
5th aspect, provides a kind of electronic equipment comprising: at least one processor, and with described at least one Manage the memory of device communication connection, wherein the memory is stored with the instruction that can be executed by least one described processor, institute It states instruction to be executed by least one described processor, so that at least one described processor is able to carry out any embodiment of the present invention Text similarity model training method and the step of text matching technique based on text similarity model.
6th aspect, the embodiment of the present invention provide a kind of storage medium, are stored thereon with computer program, and feature exists In realizing the training method of the text similarity model of any embodiment of the present invention when the program is executed by processor and be based on The step of text matching technique of text similarity model.
The beneficial effect of the embodiment of the present invention is: can be seen that by the embodiment by determining the multiple of word Feature vector is trained text similarity model, and model parameter is more abundant, and the feature being related to is more, determining text phase It is more accurate like spending.User's read statement is determined by using the text similarity model of a variety of dimensional characteristics vectors of consideration again With the characteristic similarity of default sentence each in text similarity model, and then determine relatively precisely higher matched text.In advance If dictionary collects relatively easy, advantage of lower cost.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is this hair Bright some embodiments for those of ordinary skill in the art without creative efforts, can be with root Other attached drawings are obtained according to these attached drawings.
Fig. 1 is a kind of flow chart of the training method for text similarity model that one embodiment of the invention provides;
Fig. 2 is a kind of process for text matching technique based on text similarity model that one embodiment of the invention provides Figure;
Fig. 3 is a kind of structural schematic diagram of the training system for text similarity model that one embodiment of the invention provides.
Fig. 4 is that a kind of structure for text matches system based on text similarity model that one embodiment of the invention provides is shown It is intended to.
Specific embodiment
In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without creative efforts, shall fall within the protection scope of the present invention.
A kind of flow chart of the training method of the text similarity model provided as shown in Figure 1 for one embodiment of the invention, Include the following steps:
S11: receiving dictionary training set, to default sentence word segmentation processing each in the dictionary training set, determines described default The text-string of sentence;
S12: according to the text-string of each default sentence, determine term vector corresponding with the text-string and Text phonetic corresponding with the text-string;
S13: it according to the corresponding text-string of each default sentence, text phonetic and term vector, determines described each The default corresponding feature vector of sentence, training text similarity model.
In the present embodiment, it due to no longer only comparing the number of the directly similar word of text-string, but introduces New parameter carries out multiple orientation and comprehensively considers, therefore used text similarity model is also required to further training.
For step S11, dictionary training set is received, wherein a large number of users is contained in dictionary training set in daily life In some words that may use, for example, " the first affiliated hospital, Peking University ", " the second affiliated hospital, Peking University ", " north Third affiliated hospital, capital university ", " the 4th affiliated hospital, Peking University ", " KFC ", " McDonald ", " thousand hand-pulled noodles of taste ", " pepper Work mill ", " Friendship Bridge ", " Shahe bridge ", " Yongdinghe River bridge ", " Zhenyang bridge ", " Yangtze Bridge ", " Caobai River is big Bridge " ....After receiving dictionary training set, word segmentation processing is carried out to default sentence each in the dictionary training set, is determined described pre- If the text-string of sentence, for example, the Changjiang river the text-string s1=_ bridge of " Yangtze Bridge ".Wherein Words partition system In may separate an individual word, it is also possible to separate a word.
Word corresponding with the text-string is determined according to the text-string of each default sentence for step S12 Vector and text phonetic, after step S11, the determining the Changjiang river text-string s1=_ bridge.It is true according to the text-string Fixed its text phonetic p1 and term vector w1 obtains p1=chang jiang by determination | da qiao, w1=(0.323, 0.123,...)(0.564,0.348,...).Wherein, when the text-string includes Chinese character, mapping with it is described in The corresponding text phonetic of Chinese character, when the text-string includes English character, the text phonetic of the English character For described English character itself.
For step S13, according to the corresponding text-string of each default sentence, text phonetic and term vector, really The corresponding feature vector of fixed each default sentence, feature vector cover the text-string feature of default sentence, text Phonetic feature and term vector feature, and then pass through described eigenvector training text similarity model.
It can be seen that by the embodiment by determining that multiple feature vectors of word are trained text similarity mould Type, model parameter is more abundant, and the feature being related to is more, and determining text similarity is more accurate.
A kind of text matching technique based on text similarity model of one embodiment of the invention offer is provided Flow chart includes the following steps:
S21: text information is received, determines the feature vector of the text information, wherein described eigenvector is at least wrapped It includes: text-string, text phonetic, term vector;
S22: described eigenvector is input in the text similarity model;
S23: the characteristic similarity of the text similarity model output is obtained;
S24: determine that at least one reaches the default sentence of default characteristic threshold value using as institute according to the characteristic similarity State the matched text of text information.
In the present embodiment, the text similarity model by the claim 1 training carries out specific practical application.
For step S21, text information is received, wherein the text information can be inputted according to user by voice, phase The equipment answered carries out speech recognition, and the text information obtained, can also according to user by the input method of corresponding equipment into Row input.For example, user carries out text input by input method, due to the hand shaking or general idea or other situations of user, User has got " the Changjiang river bridging " by input method.And then determine the feature vector of " the Changjiang river bridging " of user's input, including text This character string, text phonetic, term vector.Wherein, the Changjiang river text-string s2=_ bridging, text phonetic p2=chang jiang | da qiao, term vector w2=(0.1234,0.2133 ...) (0.823,0.234 ...).
For step S22, the feature vector determined in the step s 21 is input to the text similarity model In, it is compared according to the various features with the default sentence in text similarity model.
For step S23, after step s 22, the characteristic similarity of the text similarity model output is obtained, wherein Characteristic similarity includes the characteristic similarity of each default sentence in the word and text similarity model of user's input.
At least one, which reaches default threshold, is determined according to the characteristic similarity determined in step S23 for step S24 Matched text of the default sentence of value as the text information.
It can be seen that by the embodiment true by using the text similarity model of a variety of dimensional characteristics vectors of consideration Make the characteristic similarity of each default sentence in user's read statement and text similarity model, so determine relatively precisely compared with High matched text.Default dictionary collects relatively easy, advantage of lower cost.
As an implementation, in the present embodiment, the default characteristic threshold value includes pre-set text threshold value, described to obtain The characteristic similarity for taking text similarity model output includes:
When described eigenvector include at least text-string when, according to the text-string of the text information with it is described The text-string of each default sentence determines the text of the text information and each default sentence in text similarity model Similarity;
The default sentence that the text similarity is more than pre-set text threshold value is determined as matched character string set;
According at least to text-string, the text phonetic, term vector in described eigenvector, determine the text information with The characteristic similarity of default sentence in the matched character string set.
In the present embodiment, the default characteristic threshold value includes pre-set text threshold value, also, works as described eigenvector extremely When less including text-string, according to the text-string and the text similarity model of the text information of user input The text-string of interior each default sentence determines the text similarity of the text information and each default sentence.Namely first With one of various features vector feature, similarity-rough set is carried out.Determine that a range is lesser more than pre-set text threshold The matched character string set of the default sentence of value.
After determining matched character string set, in the text envelope for being determined user's input together according to various features vector The characteristic similarity of breath and the default sentence in matched character string set.
It can be seen that by the embodiment by first using single feature, to the pre- of the text similarity model If sentence carries out preliminary screening.It filters out relatively small-scale matched character string set and passes through various features vector again and determine Corresponding matched text accelerates the efficiency of determining matched text.
As an implementation, in the present embodiment, the default characteristic threshold value includes default phonetic threshold value, described to obtain The characteristic similarity for taking text similarity model output includes:
When described eigenvector includes at least text phonetic, according to the text phonetic of the text information and the text The text phonetic of each default sentence determines the pinyin similarity of the text information and each default sentence in similarity model;
The pinyin similarity is determined to be more than to preset the default sentence of phonetic threshold value as matching phonetic set;
According at least to text-string, the text phonetic, term vector in described eigenvector, determine the text information with The characteristic similarity of default sentence in the matching phonetic set.
In the present embodiment, the default characteristic threshold value includes default phonetic threshold value, also, works as described eigenvector extremely When less including text phonetic, according to each in the text phonetic and the text similarity model of the text information of user input The text phonetic of default sentence determines the pinyin similarity of the text information and each default sentence.Similarly, and first it uses One of various features vector feature carries out similarity-rough set.Determine that a range is lesser more than default phonetic threshold value Default sentence matching phonetic set.
After determining matching phonetic set, in the text information for being determined user's input together according to various features vector With the characteristic similarity of the default sentence matched in phonetic set.
It can be seen that by the embodiment by first using single feature, to the pre- of the text similarity model If sentence carries out preliminary screening.Relatively small-scale matching phonetic set is filtered out, then is driven out by various features vector Corresponding matched text accelerates the efficiency of determining matched text.
As an implementation, in the present embodiment, the default characteristic threshold value includes default vector threshold, described to obtain The characteristic similarity for taking text similarity model output includes:
It is similar to the text according to the term vector of the text information when described eigenvector includes at least term vector The term vector of each default sentence determines the vector similarity of the text information and each default sentence in degree model;
The vector similarity is determined to be more than to preset the default sentence of vector threshold as matching vector set;
According at least to text-string, the text phonetic, term vector in described eigenvector, determine the text information with The characteristic similarity of default sentence in the matching vector set.
In the present embodiment, the default characteristic threshold value includes default vector threshold, also, works as described eigenvector extremely When less including term vector, according to each default in the term vector and the text similarity model of the text information of user input The term vector of sentence determines the vector similarity of the text information and each default sentence.Similarly, and first with a variety of spies One of vector feature is levied, similarity-rough set is carried out.Determine that a range is lesser default more than default vector threshold The matching vector set of sentence.
After determining matching vector set, in the text information for being determined user's input together according to various features vector With the characteristic similarity of the default sentence in matching vector set.
It can be seen that by the embodiment by first using single feature, to the pre- of the text similarity model If sentence carries out preliminary screening.Relatively small-scale matching vector set is filtered out, then is driven out by various features vector Corresponding matched text accelerates the efficiency of determining matched text.
As an implementation, in the present embodiment, described to determine that at least one reaches default according to characteristic similarity The default sentence of characteristic threshold value includes: using the matched text as the text information
When according to the sequence of similarity from high to low, determining only one is more than the default sentence conduct for presetting characteristic threshold value When the matched text of the text information, using one default sentence as the matched text of the text information;Or
It is more than the default sentence work for presetting characteristic threshold value when having at least two according to the sequence determination of similarity from high to low For the text information matched text when, described at least two default sentences are sent to user;
Receive the default sentence of user's selection;
Using the selected default sentence as the matched text of the text information.
In the present embodiment, can according to similarity from high to low determine the default language for reaching default characteristic threshold value Matched text of the sentence as the text information.Wherein when only determining a matched text, for example, the text envelope of user's input Breath is " the Changjiang river bridging ", and a matched text of the determination by similarity by height on earth is " Yangtze Bridge ", " the Changjiang river by described in The matched text of " the Changjiang river bridging " that bridge " is inputted as user.
When determining at least two matched texts, for example, the text information of user's input is " BJ Univ Hospital ", by similar At least two determining matched texts of degree are " Peking University First Hospital ", " the second hospital, Peking University ", " Peking University's third Hospital " ... receives the default sentence of user's selection to user feedback, such as user selects " The Third Affiliated Hospital of Peking University ", by institute State matched text of the default sentence selected as text information.
It can be seen that the matched text by determining specified quantity by the embodiment, provide more for user With mode, matching range is expanded, while also improving the usage experience of user.
A kind of structural representation of the training system of text similarity model of one embodiment of the invention offer is provided Figure, which can be performed the training method of text similarity model described in above-mentioned any embodiment, and configure in the terminal.
A kind of training system of text similarity model provided in this embodiment includes: that text-string determines program module 11, term vector and text phonetic determine program module 12 and text similarity model training program module 13.
Wherein, text-string determines program module 11 for receiving dictionary training set, to each in the dictionary training set Default sentence word segmentation processing, determines the text-string of the default sentence;Term vector and text phonetic determine program module 12 For the text-string according to each default sentence, determine term vector corresponding with the text-string and with the text The corresponding text phonetic of this character string;Text similarity model training program module 13 is used for according to each default sentence pair Text-string, text phonetic and the term vector answered determine the corresponding feature vector of each default sentence, training text phase Like degree model.
A kind of text matches system based on text similarity model of one embodiment of the invention offer is provided The text matching technique based on text similarity model described in above-mentioned any embodiment can be performed in structural schematic diagram, the system, And it configures in the terminal.
A kind of text matches system based on text similarity model provided in this embodiment includes: that feature vector determines journey Sequence module 21, feature vector input program module 22, and characteristic similarity obtains program module 23 and text matches program module 24.
Wherein, feature vector determines program module 21 for receiving text information, determine the feature of the text information to Amount, wherein described eigenvector includes at least: text-string, text phonetic, term vector;Feature vector inputs program module 22 for described eigenvector to be input in the text similarity model;Characteristic similarity obtains program module 23 and is used for Obtain the characteristic similarity of the text similarity model output;Text matches program module 24 is used for similar according to the feature Degree determines that at least one reaches the default sentence of default characteristic threshold value using the matched text as the text information.
Further, the default characteristic threshold value includes pre-set text threshold value, and the characteristic similarity obtains program module For:
When described eigenvector include at least text-string when, according to the text-string of the text information with it is described The text-string of each default sentence determines the text of the text information and each default sentence in text similarity model Similarity;
The default sentence that the text similarity is more than pre-set text threshold value is determined as matched character string set;
According at least to text-string, the text phonetic, term vector in described eigenvector, determine the text information with The characteristic similarity of default sentence in the matched character string set.
Further, the default characteristic threshold value includes default phonetic threshold value, and the characteristic similarity obtains program module For:
When described eigenvector includes at least text phonetic, according to the text phonetic of the text information and the text The text phonetic of each default sentence determines the pinyin similarity of the text information and each default sentence in similarity model;
The pinyin similarity is determined to be more than to preset the default sentence of phonetic threshold value as matching phonetic set;
According at least to text-string, the text phonetic, term vector in described eigenvector, determine the text information with The characteristic similarity of default sentence in the matching phonetic set.
Further, the default characteristic threshold value includes default vector threshold, and the characteristic similarity obtains program module For:
It is similar to the text according to the term vector of the text information when described eigenvector includes at least term vector The term vector of each default sentence determines the vector similarity of the text information and each default sentence in degree model;
The vector similarity is determined to be more than to preset the default sentence of vector threshold as matching vector set;
According at least to text-string, the text phonetic, term vector in described eigenvector, determine the text information with The characteristic similarity of default sentence in the matching vector set.
Further, the text matches program module is used for:
When according to the sequence of similarity from high to low, determining only one is more than the default sentence conduct for presetting characteristic threshold value When the matched text of the text information, using one default sentence as the matched text of the text information;Or
It is more than the default sentence work for presetting characteristic threshold value when having at least two according to the sequence determination of similarity from high to low For the text information matched text when, described at least two default sentences are sent to user;
Receive the default sentence of user's selection;
Using the selected default sentence as the matched text of the text information.
The embodiment of the invention also provides a kind of nonvolatile computer storage media, computer storage medium is stored with meter The text similarity model in above-mentioned any means embodiment can be performed in calculation machine executable instruction, the computer executable instructions Training method;
As an implementation, nonvolatile computer storage media of the invention is stored with the executable finger of computer It enables, computer executable instructions setting are as follows:
It receives dictionary training set and the default sentence is determined to default sentence word segmentation processing each in the dictionary training set Text-string;
According to the text-string of each default sentence, determine term vector corresponding with the text-string and with institute State the corresponding text phonetic of text-string;
According to the corresponding text-string of each default sentence, text phonetic and term vector, determine described each default The corresponding feature vector of sentence, training text similarity model.
The embodiment of the invention also provides a kind of nonvolatile computer storage media, computer storage medium is stored with meter Calculation machine executable instruction, the computer executable instructions can be performed in above-mentioned any means embodiment based on text similarity mould The text matching technique of type;
As an implementation, nonvolatile computer storage media of the invention is stored with the executable finger of computer It enables, computer executable instructions setting are as follows:
Text information is received, determines the feature vector of the text information, wherein described eigenvector includes at least: text This character string, text phonetic, term vector;
Described eigenvector is input in the text similarity model;
Obtain the characteristic similarity of the text similarity model output;
Determine that at least one reaches the default sentence of default characteristic threshold value using as the text according to the characteristic similarity The matched text of this information.
As a kind of non-volatile computer readable storage medium storing program for executing, it can be used for storing non-volatile software program, non-volatile Property computer executable program and module, such as the corresponding program instruction/mould of the method for the test software in the embodiment of the present invention Block.One or more program instruction is stored in non-volatile computer readable storage medium storing program for executing, when being executed by a processor, is held The training method of text similarity model in the above-mentioned any means embodiment of row and text based on text similarity model Matching process.
Non-volatile computer readable storage medium storing program for executing may include storing program area and storage data area, wherein storage journey It sequence area can application program required for storage program area, at least one function;Storage data area can be stored according to test software Device use created data etc..In addition, non-volatile computer readable storage medium storing program for executing may include that high speed is deposited at random Access to memory, can also include nonvolatile memory, a for example, at least disk memory, flush memory device or other are non- Volatile solid-state part.In some embodiments, it includes relative to place that non-volatile computer readable storage medium storing program for executing is optional The remotely located memory of device is managed, these remote memories can be by being connected to the network to the device of test software.Above-mentioned network Example include but is not limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
The embodiment of the present invention also provides a kind of electronic equipment comprising: at least one processor, and with described at least one The memory of a processor communication connection, wherein the memory is stored with the finger that can be executed by least one described processor Enable, described instruction executed by least one described processor so that at least one described processor be able to carry out it is of the invention any The step of training method of the text similarity model of embodiment and text matching technique based on text similarity model.
The client of the embodiment of the present application exists in a variety of forms, including but not limited to:
(1) mobile communication equipment: the characteristics of this kind of equipment is that have mobile communication function, and to provide speech, data Communication is main target.This Terminal Type includes: smart phone (such as iPhone), multimedia handset, functional mobile phone and low Hold mobile phone etc..
(2) super mobile personal computer equipment: this kind of equipment belongs to the scope of personal computer, there is calculating and processing function Can, generally also have mobile Internet access characteristic.This Terminal Type includes: PDA, MID and UMPC equipment etc., such as iPad.
(3) portable entertainment device: this kind of equipment can show and play multimedia content.Such equipment include: audio, Video player (such as iPod), handheld device, e-book and intelligent toy and portable car-mounted navigation equipment.
(4) other electronic devices having data processing function.
Herein, relational terms such as first and second and the like be used merely to by an entity or operation with it is another One entity or operation distinguish, and without necessarily requiring or implying between these entities or operation, there are any this reality Relationship or sequence.Moreover, the terms "include", "comprise", include not only those elements, but also including being not explicitly listed Other element, or further include for elements inherent to such a process, method, article, or device.Do not limiting more In the case where system, the element that is limited by sentence " including ... ", it is not excluded that including process, method, the article of the element Or there is also other identical elements in equipment.
The apparatus embodiments described above are merely exemplary, wherein described, unit can as illustrated by the separation member It is physically separated with being or may not be, component shown as a unit may or may not be physics list Member, it can it is in one place, or may be distributed over multiple network units.It can be selected according to the actual needs In some or all of the modules achieve the purpose of the solution of this embodiment.Those of ordinary skill in the art are not paying creativeness Labour in the case where, it can understand and implement.
Through the above description of the embodiments, those skilled in the art can be understood that each embodiment can It realizes by means of software and necessary general hardware platform, naturally it is also possible to pass through hardware.Based on this understanding, on Stating technical solution, substantially the part that contributes to existing technology can be embodied in the form of software products in other words, should Computer software product may be stored in a computer readable storage medium, such as ROM/RAM, magnetic disk, CD, including several fingers It enables and using so that a computer equipment (can be personal computer, server or the network equipment etc.) executes each implementation Method described in certain parts of example or embodiment.
Finally, it should be noted that the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations;Although Present invention has been described in detail with reference to the aforementioned embodiments, those skilled in the art should understand that: it still may be used To modify the technical solutions described in the foregoing embodiments or equivalent replacement of some of the technical features; And these are modified or replaceed, technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution spirit and Range.

Claims (10)

1. a kind of training method of text similarity model, comprising:
It receives dictionary training set and the text of the default sentence is determined to default sentence word segmentation processing each in the dictionary training set This character string;
According to the text-string of each default sentence, determine term vector corresponding with the text-string and with the text The corresponding text phonetic of this character string;
According to the corresponding text-string of each default sentence, text phonetic and term vector, each default sentence is determined Corresponding feature vector, training text similarity model.
2. a kind of text matching technique according to claim 1 based on text similarity model, comprising:
Text information is received, determines the feature vector of the text information, wherein described eigenvector includes at least: text word Symbol string, text phonetic, term vector;
Described eigenvector is input in the text similarity model;
Obtain the characteristic similarity of the text similarity model output;
Determine that at least one reaches the default sentence of default characteristic threshold value using as the text envelope according to the characteristic similarity The matched text of breath.
3. according to the method described in claim 2, wherein, the default characteristic threshold value includes pre-set text threshold value, the acquisition The characteristic similarity of text similarity model output includes:
When described eigenvector includes at least text-string, according to the text-string of the text information and the text The text-string of each default sentence determines that the text information is similar with the text of each default sentence in similarity model Degree;
The default sentence that the text similarity is more than pre-set text threshold value is determined as matched character string set;
According at least to text-string, the text phonetic, term vector in described eigenvector, determine the text information with it is described The characteristic similarity of default sentence in matched character string set.
4. according to the method described in claim 2, wherein, the default characteristic threshold value includes default phonetic threshold value, the acquisition The characteristic similarity of text similarity model output includes:
It is similar to the text according to the text phonetic of the text information when described eigenvector includes at least text phonetic The text phonetic of each default sentence determines the pinyin similarity of the text information and each default sentence in degree model;
The pinyin similarity is determined to be more than to preset the default sentence of phonetic threshold value as matching phonetic set;
According at least to text-string, the text phonetic, term vector in described eigenvector, determine the text information with it is described Match the characteristic similarity of the default sentence in phonetic set.
5. according to the method described in claim 2, wherein, the default characteristic threshold value includes default vector threshold, the acquisition The characteristic similarity of text similarity model output includes:
When described eigenvector includes at least term vector, according to the term vector of the text information and the text similarity mould The term vector of each default sentence determines the vector similarity of the text information and each default sentence in type;
The vector similarity is determined to be more than to preset the default sentence of vector threshold as matching vector set;
According at least to text-string, the text phonetic, term vector in described eigenvector, determine the text information with it is described The characteristic similarity of default sentence in matching vector set.
6. described to determine that at least one reaches default feature according to characteristic similarity according to the method described in claim 2, wherein The default sentence of threshold value includes: using the matched text as the text information
Described in determining that only having a default sentence more than default characteristic threshold value is used as according to the sequence of similarity from high to low When the matched text of text information, using one default sentence as the matched text of the text information;Or
It is more than the default sentence of default characteristic threshold value as institute when having at least two according to the sequence determination of similarity from high to low When stating the matched text of text information, described at least two default sentences are sent to user;
Receive the default sentence of user's selection;
Using the selected default sentence as the matched text of the text information.
7. a kind of training system of text similarity model, comprising:
Text-string determines program module, for receiving dictionary training set, to each default sentence in the dictionary training set point Word processing, determines the text-string of the default sentence;
Term vector and text phonetic determine program module, for the text-string according to each default sentence, the determining and text The corresponding term vector of this character string and text phonetic corresponding with the text-string;
Text similarity model training program module, for being spelled according to the corresponding text-string of each default sentence, text Sound and term vector determine the corresponding feature vector of each default sentence, training text similarity model.
8. a kind of text matches system according to claim 7 based on text similarity model, comprising:
Feature vector determines program module, for receiving text information, determines the feature vector of the text information, wherein institute It states feature vector to include at least: text-string, text phonetic, term vector;
Feature vector inputs program module, for described eigenvector to be input in the text similarity model;
Characteristic similarity obtains program module, for obtaining the characteristic similarity of the text similarity model output;
Text matches program module, for determining that at least one reaches the default of default characteristic threshold value according to the characteristic similarity Sentence is using the matched text as the text information.
9. a kind of electronic equipment comprising: at least one processor, and deposited with what at least one described processor communication was connect Reservoir, wherein the memory be stored with can by least one described processor execute instruction, described instruction by it is described at least One processor executes, so that at least one described processor is able to carry out the step of any one of claim 1-6 the method Suddenly.
10. a kind of storage medium, is stored thereon with computer program, which is characterized in that the realization when program is executed by processor The step of any one of claim 1-6 the method.
CN201811344782.5A 2018-11-13 2018-11-13 Text matching technique and system based on text similarity model Pending CN109460461A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811344782.5A CN109460461A (en) 2018-11-13 2018-11-13 Text matching technique and system based on text similarity model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811344782.5A CN109460461A (en) 2018-11-13 2018-11-13 Text matching technique and system based on text similarity model

Publications (1)

Publication Number Publication Date
CN109460461A true CN109460461A (en) 2019-03-12

Family

ID=65610191

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811344782.5A Pending CN109460461A (en) 2018-11-13 2018-11-13 Text matching technique and system based on text similarity model

Country Status (1)

Country Link
CN (1) CN109460461A (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245606A (en) * 2019-06-13 2019-09-17 广东小天才科技有限公司 A kind of text recognition method, device, equipment and storage medium
CN110390015A (en) * 2019-07-23 2019-10-29 中国工商银行股份有限公司 A kind of data information processing method, apparatus and system
CN110413988A (en) * 2019-06-17 2019-11-05 平安科技(深圳)有限公司 Method, apparatus, server and the storage medium of text information matching measurement
CN110516125A (en) * 2019-08-28 2019-11-29 拉扎斯网络科技(上海)有限公司 Identify method, apparatus, equipment and the readable storage medium storing program for executing of unusual character string
CN110717158A (en) * 2019-09-06 2020-01-21 平安普惠企业管理有限公司 Information verification method, device, equipment and computer readable storage medium
CN111009244A (en) * 2019-12-06 2020-04-14 贵州电网有限责任公司 Voice recognition method and system
CN111159339A (en) * 2019-12-24 2020-05-15 北京亚信数据有限公司 Text matching processing method and device
CN111159338A (en) * 2019-12-23 2020-05-15 北京达佳互联信息技术有限公司 Malicious text detection method and device, electronic equipment and storage medium
CN111753551A (en) * 2020-06-29 2020-10-09 北京字节跳动网络技术有限公司 Information generation method and device based on word vector generation model
CN112000767A (en) * 2020-07-31 2020-11-27 深思考人工智能科技(上海)有限公司 Text-based information extraction method and electronic equipment
CN113932518A (en) * 2021-06-02 2022-01-14 海信(山东)冰箱有限公司 Refrigerator and food material management method thereof
WO2022095370A1 (en) * 2020-11-06 2022-05-12 平安科技(深圳)有限公司 Text matching method and apparatus, terminal device, and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605694A (en) * 2013-11-04 2014-02-26 北京奇虎科技有限公司 Device and method for detecting similar texts
CN104102626A (en) * 2014-07-07 2014-10-15 厦门推特信息科技有限公司 Method for computing semantic similarities among short texts
CN104239512A (en) * 2014-09-16 2014-12-24 电子科技大学 Text recommendation method
US8996515B2 (en) * 2008-06-24 2015-03-31 Microsoft Corporation Consistent phrase relevance measures
CN104699763A (en) * 2015-02-11 2015-06-10 中国科学院新疆理化技术研究所 Text similarity measuring system based on multi-feature fusion
CN106095928A (en) * 2016-06-12 2016-11-09 国家计算机网络与信息安全管理中心 A kind of event type recognition methods and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8996515B2 (en) * 2008-06-24 2015-03-31 Microsoft Corporation Consistent phrase relevance measures
CN103605694A (en) * 2013-11-04 2014-02-26 北京奇虎科技有限公司 Device and method for detecting similar texts
CN104102626A (en) * 2014-07-07 2014-10-15 厦门推特信息科技有限公司 Method for computing semantic similarities among short texts
CN104239512A (en) * 2014-09-16 2014-12-24 电子科技大学 Text recommendation method
CN104699763A (en) * 2015-02-11 2015-06-10 中国科学院新疆理化技术研究所 Text similarity measuring system based on multi-feature fusion
CN106095928A (en) * 2016-06-12 2016-11-09 国家计算机网络与信息安全管理中心 A kind of event type recognition methods and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
梁敬东 等: "基于word2vec和LSTM的句子相似度计算及其", 《南京农业大学学报》 *

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110245606B (en) * 2019-06-13 2021-07-20 广东小天才科技有限公司 Text recognition method, device, equipment and storage medium
CN110245606A (en) * 2019-06-13 2019-09-17 广东小天才科技有限公司 A kind of text recognition method, device, equipment and storage medium
CN110413988A (en) * 2019-06-17 2019-11-05 平安科技(深圳)有限公司 Method, apparatus, server and the storage medium of text information matching measurement
CN110413988B (en) * 2019-06-17 2023-01-31 平安科技(深圳)有限公司 Text information matching measurement method, device, server and storage medium
CN110390015A (en) * 2019-07-23 2019-10-29 中国工商银行股份有限公司 A kind of data information processing method, apparatus and system
CN110516125A (en) * 2019-08-28 2019-11-29 拉扎斯网络科技(上海)有限公司 Identify method, apparatus, equipment and the readable storage medium storing program for executing of unusual character string
CN110516125B (en) * 2019-08-28 2020-05-08 拉扎斯网络科技(上海)有限公司 Method, device and equipment for identifying abnormal character string and readable storage medium
CN110717158A (en) * 2019-09-06 2020-01-21 平安普惠企业管理有限公司 Information verification method, device, equipment and computer readable storage medium
CN110717158B (en) * 2019-09-06 2024-03-01 冉维印 Information verification method, device, equipment and computer readable storage medium
CN111009244A (en) * 2019-12-06 2020-04-14 贵州电网有限责任公司 Voice recognition method and system
CN111159338A (en) * 2019-12-23 2020-05-15 北京达佳互联信息技术有限公司 Malicious text detection method and device, electronic equipment and storage medium
CN111159339A (en) * 2019-12-24 2020-05-15 北京亚信数据有限公司 Text matching processing method and device
WO2022001888A1 (en) * 2020-06-29 2022-01-06 北京字节跳动网络技术有限公司 Information generation method and device based on word vector generation model
CN111753551B (en) * 2020-06-29 2022-06-14 北京字节跳动网络技术有限公司 Information generation method and device based on word vector generation model
CN111753551A (en) * 2020-06-29 2020-10-09 北京字节跳动网络技术有限公司 Information generation method and device based on word vector generation model
CN112000767A (en) * 2020-07-31 2020-11-27 深思考人工智能科技(上海)有限公司 Text-based information extraction method and electronic equipment
WO2022095370A1 (en) * 2020-11-06 2022-05-12 平安科技(深圳)有限公司 Text matching method and apparatus, terminal device, and storage medium
CN113932518A (en) * 2021-06-02 2022-01-14 海信(山东)冰箱有限公司 Refrigerator and food material management method thereof
CN113932518B (en) * 2021-06-02 2023-08-18 海信冰箱有限公司 Refrigerator and food material management method thereof

Similar Documents

Publication Publication Date Title
CN109460461A (en) Text matching technique and system based on text similarity model
CN105976812B (en) A kind of audio recognition method and its equipment
US10043520B2 (en) Multilevel speech recognition for candidate application group using first and second speech commands
US20170164049A1 (en) Recommending method and device thereof
CN107526846B (en) Method, device, server and medium for generating and sorting channel sorting model
CN112037792B (en) Voice recognition method and device, electronic equipment and storage medium
CN103699530A (en) Method and equipment for inputting texts in target application according to voice input information
CN104361896B (en) Voice quality assessment equipment, method and system
CN113407850B (en) Method and device for determining and acquiring virtual image and electronic equipment
US20170171471A1 (en) Method and device for generating multimedia picture and an electronic device
CN104866308A (en) Scenario image generation method and apparatus
CN103235773B (en) The tag extraction method and device of text based on keyword
CN110517692A (en) Hot word audio recognition method and device
US20230029687A1 (en) Dialog method and system, electronic device and storage medium
CN111028828A (en) Voice interaction method based on screen drawing, screen drawing and storage medium
CN105354318A (en) File searching method and device
CN109410935A (en) A kind of destination searching method and device based on speech recognition
CN107112007A (en) Speech recognition equipment and audio recognition method
CN111859970B (en) Method, apparatus, device and medium for processing information
CN110570838B (en) Voice stream processing method and device
JP7372402B2 (en) Speech synthesis method, device, electronic device and storage medium
CN111680514A (en) Information processing and model training method, device, equipment and storage medium
CN111477212A (en) Content recognition, model training and data processing method, system and equipment
CN109147819A (en) Audio-frequency information processing method, device and storage medium
CN114708859A (en) Voice command word recognition training method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant after: Sipic Technology Co.,Ltd.

Address before: 215123 building 14, Tengfei Innovation Park, 388 Xinping street, Suzhou Industrial Park, Suzhou City, Jiangsu Province

Applicant before: AI SPEECH Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20190312