CN105786793B - Parse the semantic method and apparatus of spoken language text information - Google Patents

Parse the semantic method and apparatus of spoken language text information Download PDF

Info

Publication number
CN105786793B
CN105786793B CN201510977813.0A CN201510977813A CN105786793B CN 105786793 B CN105786793 B CN 105786793B CN 201510977813 A CN201510977813 A CN 201510977813A CN 105786793 B CN105786793 B CN 105786793B
Authority
CN
China
Prior art keywords
feature
field
text information
regular expression
weighted value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510977813.0A
Other languages
Chinese (zh)
Other versions
CN105786793A (en
Inventor
陈由之
时培轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201510977813.0A priority Critical patent/CN105786793B/en
Publication of CN105786793A publication Critical patent/CN105786793A/en
Application granted granted Critical
Publication of CN105786793B publication Critical patent/CN105786793B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application discloses the semantic method and apparatus of parsing spoken language text information.One specific embodiment of the method includes: to be segmented received spoken language text information to extract feature;The association field of spoken language text information is determined by the noun in the feature extracted;In response to being associated with the default feature in field in the preset database of characteristic matching of extraction, weighted value by default feature in association field is determined as the feature extracted in the weighted value in association field, wherein, preset database can include but is not limited to default feature in the weighted value of multiple fields, and multiple fields can include but is not limited to association field;Weighted value based on the feature of extraction in association field determines text information in the score value of the regular expression in association field;Score value is ranked up, the regular expression of preset quantity is obtained according to the result of sequence;Parsing text of the regular expression that will acquire as spoken language text information.This embodiment improves the accuracys for obtaining semantic parsing result.

Description

Parse the semantic method and apparatus of spoken language text information
Technical field
This application involves field of computer technology, and in particular to technical field of voice recognition, more particularly to the spoken text of parsing The semantic method and apparatus of this information.
Background technique
Spoken semantic parsing is the information of understanding spoken language voice signal carrying, is carried out in the voice signal inputted for user After spoken semantic parsing, it can be retrieved according to the parsing text of spoken language text information, so that the speed of retrieval information is improved, Improve the updating ability of information.
It is spoken language text information that currently used spoken semantic analytic method, which is by spoken voice signal identification, is adopted later Spoken language text information is parsed with the method for rule match, obtains the parsing text of spoken language text information.
However, the semantic analytic method of current spoken language, in the method using rule match to same spoken language text information When being parsed to obtain the parsing text of spoken language text information, tend to obtain a plurality of parsing text, and not can determine that Which item more approaches user's intention to be expressed in a plurality of parsing text.
Summary of the invention
The purpose of the application is to propose a kind of semantic method and apparatus of improved parsing spoken language text information, to solve The technical issues of certainly background section above is mentioned.
In a first aspect, this application provides a kind of semantic methods of parsing spoken language text information, which comprises right Received spoken language text information is segmented to extract feature;The spoken language text information is determined by the noun in the feature extracted Association field;The default feature in field is associated with described in the preset database of characteristic matching in response to the extraction, by institute Weighted value of the feature in the association field that weighted value of the default feature in the association field is determined as the extraction is stated, In, the preset database includes weighted value of the default feature in multiple fields, and the multiple field includes the association neck Domain;Feature based on the extraction determines the text information in the association field in the weighted value in the association field The score value of regular expression;The score value is ranked up, the regular expression of preset quantity is obtained according to the result of sequence;It will Parsing text of the regular expression of acquisition as the spoken language text information.
In some embodiments, the default feature is determined in the weighted value of multiple fields by following processing: multiple The number for default feature occur in each field in field presets total word of the text information sample of feature divided by appearance Number obtains the frequency that default feature occurs in each field;By the quantity for the text information sample for the default feature occur Divided by the quantity of total text information sample, the reverse document-frequency of the default feature is obtained, wherein described described preset occur The text information sample of feature and total text information sample are by having parsed the historical data of semantic spoken language text information It obtains;The frequency that the default feature is occurred in each field is obtained multiplied by the reverse document-frequency of the default feature Feature is preset in the weighted value in each field, and the weighted value according to the default feature in each field, obtains institute Default feature is stated in the weighted value of multiple fields.
In some embodiments, the feature based on the extraction the association field weighted value, determine described in Text information includes: in the association field, by the spy of the extraction in the score value of the regular expression in the association field The weighted value that the feature of regular expression is hit in sign is added, and obtains the text information in the regular expressions in the association field The score value of formula.
In some embodiments, association field described in the preset database of characteristic matching in response to the extraction is pre- If feature, the weighted value by the default feature in the association field is determined as the feature of the extraction in the association field Weighted value include: the default filtering vocabulary of hit in the feature for filter out the extraction feature, obtain filtered feature;Response The default feature in field is associated with described in the filtered preset database of characteristic matching, by the default feature in institute The weighted value for stating association field is determined as the filtered feature in the weighted value in the association field;And it is described described In association field, the weighted value that the feature of regular expression is hit in the feature of the extraction is added, the text envelope is obtained Cease the regular expression in the association field score value include: in the association field, will be in the filtered feature The weighted value for hitting the feature of regular expression is added, and obtains the score value of the regular expression of the text information.
In some embodiments, the feature based on the extraction the association field weighted value, determine described in Score value of the text information in the regular expression in the association field further include: obtain text information described by following steps The regular expression in association field: the type label of entity information is identified from the feature of the extraction;In response to the class of identification It is associated with the preset type label that the regular expression in field has described in type tag match initialized data base, will have default Type label regular expression of the regular expression as the text information in the association field, wherein it is described pre- The database set includes the regular expression with preset kind label in the multiple field.
In some embodiments, the type label that entity information is identified from the feature of the extraction includes: from institute State the positional relationship between the verb, noun and verb and noun that identify entity information in the feature of extraction;And the sound The preset type mark that should have in the regular expression that the type label of identification matches association field described in initialized data base Label, the regular expression using the regular expression with preset type label as the text information in the association field It include: pass described in the positional relationship matching initialized data base between the verb, noun and verb and noun in response to identification The positional relationship between preset verb, noun and verb and noun that the regular expression in connection field has, will have pre- If verb, positional relationship between noun and verb and noun regular expression as the text information in the pass The regular expression in connection field.
Second aspect, this application provides a kind of semantic devices of parsing spoken language text information, and described device includes: spy Extraction module is levied, for being segmented received spoken language text information to extract feature;Field determining module, for by extracting Feature in noun determine the association field of the spoken language text information;Weight determination module, in response to the extraction The preset database of characteristic matching described in be associated with field default feature, by the default feature in the association field Weighted value is determined as weighted value of the feature in the association field of the extraction, wherein the preset database includes pre- If feature, in the weighted value of multiple fields, the multiple field includes the association field;Score value determining module, for being based on institute Weighted value of the feature in the association field for stating extraction, determines the text information in the regular expression in the association field Score value;Expression formula obtains module, and for being ranked up to the score value, the canonical of preset quantity is obtained according to the result of sequence Expression formula;Parse text module, parsing text of the regular expression as the spoken language text information for will acquire.
In some embodiments, the default feature in the weight determination module passes through in the weighted value of multiple fields Determined with lower module: the frequency of occurrences obtains module, time for there is default feature in each field of multiple fields Number obtains the frequency that default feature occurs in each field divided by the total word number for the text information sample for default feature occur Rate;Reverse document-frequency obtains module, for that the quantity of the text information sample of the default feature will occur divided by total text The quantity of message sample obtains the reverse document-frequency of the default feature, wherein the text for the default feature occur Message sample and total text information sample are obtained by the historical data for having parsed semantic spoken language text information;Weighted value Obtain module, the frequency for occurring the default feature in each field multiplied by the default feature reverse file frequently Rate, obtains default feature in the weighted value in each field, and the weighted value according to the default feature in each field, The default feature is obtained in the weighted value of multiple fields.
In some embodiments, the score value determining module includes: addition submodule, is used in the association field, The weighted value for hitting the feature of regular expression in the feature of the extraction is added, obtains the text information in the association The score value of the regular expression in field.
In some embodiments, the weight determination module includes: that feature filters out submodule, for filtering out the extraction The feature that default filtering vocabulary is hit in feature, obtains filtered feature;Weight determines submodule, in response to the mistake It is associated with the default feature in field described in the preset database of characteristic matching after filter, the default feature is led in the association The weighted value in domain is determined as the filtered feature in the weighted value in the association field;And the addition submodule packet It includes: in the association field, the weighted value that the feature of regular expression is hit in the filtered feature being added, is obtained The score value of the regular expression of the text information.
In some embodiments, the score value determining module further include: expression formula determining module, comprising: type label is known Other module, for identifying the type label of entity information from the feature of the extraction;Expression formula matching module, in response to The preset type label that the regular expression in association field described in the type label matching initialized data base of identification has, will Regular expression of the regular expression with preset type label as the text information in the association field, In, the preset database includes the regular expression with preset kind label in the multiple field.
In some embodiments, the type label identification module is further used for: identifying from the feature of the extraction Verb, noun in entity information and the positional relationship between verb and noun;And the expression formula matching module is into one Step is used for: in response to described in the positional relationship matching initialized data base between the verb, noun and verb and noun of identification The positional relationship between preset verb, noun and verb and noun that the regular expression in association field has, will have The regular expression of positional relationship between preset verb, noun and verb and noun is as the text information described The regular expression in association field.
The semantic method and apparatus of parsing spoken language text information provided by the present application, by believing received spoken language text Breath is segmented to extract feature, determines the association field of spoken language text information by the noun in the feature extracted later, later Power in response to being associated with the default feature in field in the preset database of characteristic matching of extraction, by default feature in association field Weight values are determined as weighted value of the feature in association field of extraction, later the weighted value based on the feature of extraction in association field, Determine that text information in the score value of the regular expression in association field, is then ranked up score value, the result based on sequence obtains The regular expression for taking preset quantity, parsing text of the regular expression that finally will acquire as spoken language text information.At this In method, the weighted value of the feature of extraction in association field represents importance of the feature in association field of extraction, and according to Importance of the feature of extraction in association field obtains the parsing text of spoken language text information, improves the semantic parsing result of acquisition Accuracy.
Detailed description of the invention
By reading a detailed description of non-restrictive embodiments in the light of the attached drawings below, the application's is other Feature, objects and advantages will become more apparent upon:
Fig. 1 is that this application can be applied to exemplary system architecture figures therein;
Fig. 2 is the schematic flow according to one embodiment of the semantic method of the parsing spoken language text information of the application Figure;
Fig. 3 is the exemplary structure according to one embodiment of the semantic device of the parsing spoken language text information of the application Figure;
Fig. 4 is adapted for the structural representation of the computer system for the terminal device or server of realizing the embodiment of the present application Figure.
Specific embodiment
The application is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining related invention, rather than the restriction to the invention.It also should be noted that in order to Convenient for description, part relevant to related invention is illustrated only in attached drawing.
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The application is described in detail below with reference to the accompanying drawings and in conjunction with the embodiments.
Fig. 1 is shown can be using the semantic method or parsing spoken language text letter of the parsing spoken language text information of the application The exemplary system architecture 100 of the embodiment of the semantic device of breath.
As shown in Figure 1, system architecture 100 may include terminal device 101,102,103, network 104 and server 105. Network 104 between terminal device 101,102,103 and server 105 to provide the medium of communication link.Network 104 can be with Including various connection types, such as wired, wireless communication link or fiber optic cables etc..
User can be used terminal device 101,102,103 and be interacted by network 104 with server 105, to receive or send out Send message etc..The client application of various support spoken voice identifications can be installed on terminal device 101,102,103, such as Web browser applications, shopping class application, searching class application, instant messaging tools, mailbox client, social platform software etc..
Terminal device 101,102,103 can be with display screen and the various electronics of spoken voice identification supported to set It is standby, including but not limited to smart phone, tablet computer, E-book reader, MP3 (Moving Picture Experts Group Audio Layer III, dynamic image expert's compression standard audio level 3) player, MP4 (Moving Picture Experts Group Audio Layer IV, dynamic image expert's compression standard audio level 4) it is player, on knee portable Computer and desktop computer etc..
Server 105 can be to provide the server of various services, such as to various branch on terminal device 101,102,103 The client application for holding spoken voice identification provides the background server supported.Background server can be to the spoken language received The data such as sound signal carry out the processing such as analyzing, and processing result (such as spoken semantic parsing result) is fed back to terminal and is set It is standby.
It should be noted that the semantic method of parsing spoken language text information is generally by taking provided by the embodiment of the present application Business device 105 executes, and correspondingly, the semantic device of parsing spoken language text information is generally positioned in server 105.
It should be understood that the number of terminal device, network and server in Fig. 1 is only schematical.According to realization need It wants, can have any number of terminal device, network and server.
With continued reference to Fig. 2, Fig. 2 shows according to one of the semantic method of the parsing spoken language text information of the application The process 200 of embodiment.The semantic method of the parsing spoken language text information, comprising the following steps:
Step 201, received spoken language text information is segmented to extract feature.
In the present embodiment, there is enough data processing energy if receiving the electronic equipment of user's spoken voice signal itself Power, the then semantic method for parsing spoken language text information can directly run in electronic equipment that (such as terminal shown in FIG. 1 is set Standby or server);If the electronic equipment (such as terminal device shown in FIG. 1) itself for receiving user's spoken voice signal does not have Received spoken voice signal can be then transmitted to the electronic equipment with higher position reason ability by enough data-handling capacities (such as server shown in FIG. 1), with higher position reason ability electronic equipment in by spoken voice signal identification be spoken language Text information, and the further semantic method of operation parsing spoken language text information.Above-mentioned spoken language text information passes through identification Spoken voice signal obtains.The method for identifying spoken voice signal can be in the prior art or the technology of future development for knowing The method of other spoken voice signal, the application do not limit this.Above-mentioned radio connection can include but is not limited to 3G/4G Connection, WiFi connection, bluetooth connection, WiMAX connection, Zigbee connection, UWB (ultra wideband) connection and other Currently known or exploitation in the future radio connection.
In the present embodiment, participle is carried out to received spoken language text information and refers to spoken language text information cutting to be multiple Individual word.It can be currently known or exploitation in the future participle side to the method that received spoken language text information is segmented Method, the application do not limit this.For example, existing segmentation methods can be divided into three categories: the participle side based on string matching Method, the segmenting method based on understanding and the segmenting method based on statistics.It is combined according to whether with part-of-speech tagging process, and can be with It is divided into the integral method that simple segmenting method and participle are combined with mark.It is with the segmenting method based on string matching Example can match spoken language text information and the entry in a huge machine dictionary of capacity according to certain strategy, if Some character string is found in dictionary, then successful match, obtain multiple individual words.
After segmenting to received spoken language text information, feature can be extracted to obtained multiple individual words, The feature that feature extraction is extracted is carried out to multiple individual words.Here feature refers to for indicating the substantially single of text Position, feature are generally configured with following characteristic: feature can identify content of text really, feature has target text and other texts The ability mutually distinguished, the number of feature cannot too many, character separation be easier to realize.In Chinese text, word, word can be used Or phrase is as the feature for indicating text.In comparison, word has stronger ability to express than word, and word is compared with phrase, The cutting difficulty of word is more much smaller than the cutting difficulty of phrase.Therefore, most of Chinese Text Classification Systems all use word to make at present It is characterized, referred to as Feature Words.Intermediate representation of these Feature Words as document, for realize document and document, document with Similarity calculation between ownership goal.If the dimension of feature vector will be excessively huge using all words all as feature Greatly, too big so as to cause calculation amount, it is therefore desirable to carry out feature extraction, the major function of feature extraction is not damage text core It reduces word number to be processed in the case where heart information to the greatest extent, dimension of a vector space is reduced with this, is calculated to simplify, is improved The speed and efficiency of text-processing.It should be appreciated that when extracting feature, can using the method for extraction feature in the prior art or The method that feature is extracted in WeiLai Technology carries out feature extraction, and the application does not limit this.By taking the prior art as an example, carry out special The mode that sign is extracted includes at least following several: primitive character being transformed to less new feature with the method for mapping or transformation; Some most representative features are picked out from primitive character;Most influential feature is selected according to the knowledge of expert;And It is chosen with the method for mathematics, finds out the feature of most classification information.It, can be according to some spy in a specific application Sign valuation functions calculate the score value of each feature, are then ranked up by score value to these features, choose several scorings It is worth highest feature as the feature extracted.
As an example, can be segmented to short sentence " I will go to Baidu mansion " to extract feature, can be obtained by participle To " I ", " wanting ", " going ", " Baidu mansion " word segmentation result, to obtain " I ", " wanting ", " going ", " Baidu mansion " four spies Sign.
Step 202, the association field of spoken language text information is determined by the noun in the feature extracted.
In the present embodiment, based on the noun in the feature extracted in step 201, the pass of spoken language text information can be determined Connection field.For example, can determine that the association field of " I will go to Baidu mansion " is map field based on above-mentioned " Baidu mansion ".
In the present embodiment, the association field of spoken language text information can include but is not limited to following one or more necks Domain: music field, map field, address list field, TV programme field, movie news field and television command field etc..
Step 203, in response to the default feature in association field in the preset database of the characteristic matching of extraction, by default spy It levies the weighted value in association field and is determined as the feature extracted in the weighted value in association field.
In the present embodiment, above-mentioned preset database can include but is not limited to default feature in the weight of multiple fields Value, wherein multiple fields can include but is not limited to association field.
The electronic equipment of the semantic method of operation parsing spoken language text information, in the characteristic matching extracted and presets Database in default feature matching when, can determine the association field in preset database first, with reduce matching Range, and then improve and carry out matched efficiency, field will be associated in the feature of said extracted and preset database later Default feature is matched one by one, will if the characteristic matching extracted is associated with the default feature in field into preset database Weighted value of the default feature in association field is determined as the feature extracted in the weighted value in association field.
Feature is preset in some optional implementations of the present embodiment, in above-mentioned preset database in multiple fields Weighted value determined by following processing: the number for default feature occur in each field of multiple fields is divided by appearance Total word number of the text information sample of default feature obtains the frequency that default feature occurs in each field;To occur pre- If the quantity of the text information sample of feature divided by the quantity of total text information sample, obtains the reverse file frequency of default feature Rate, wherein the text information sample of default feature occur and total text information sample is believed by having parsed semantic spoken language text The historical data of breath obtains;The frequency that default feature is occurred in each field multiplied by default feature reverse document-frequency, Default feature is obtained to obtain in the weighted value in each field, and according to the default feature in the weighted value in each field To the default feature multiple fields weighted value.
In this implementation, determine default feature in multiple necks by calculating the reverse document-frequency TF-IDF of word frequency- When the weighted value in domain, TF indicates the frequency that default feature occurs in the text information sample of every field, can be by will be pre- If the number that feature occurs is obtained divided by the total word number for the text information sample for default feature occur, feature is preset in a document The number of appearance is more, then the TF value for presetting feature is bigger;IDF indicates reverse document-frequency, by that default feature will occur The quantity of text information sample is obtained divided by the quantity of total text information sample, it is meant that in multiple fields, if there is pre- If the quantity of the text information sample of feature is fewer, then the IDF value of this feature is bigger;The product of TF and IDF is default feature Weighted value, that is, default feature is in the weighted value of multiple fields.For example, for default feature " match ", in short sentence Weight in " me please be helped to inquire the schedules of Warriors' tomorrow match " is greater than in short sentence " me is reminded to watch the match tomorrow " Weight, that is to say, that weight of the feature " match " in competitive sports field be greater than remind field weight.
In some optional implementations of the present embodiment, in response to being associated in the preset database of characteristic matching of extraction The default feature in field, the weighted value by default feature in association field are determined as the feature extracted in the weighted value in association field It can include but is not limited to: filtering out the feature for hitting default filtering vocabulary in the feature of extraction, obtain filtered feature;Response The default feature in field, the weight by default feature in association field are associated in the preset database of filtered characteristic matching Value is determined as filtered feature in the weighted value in association field.
Step 204, the weighted value based on the feature of extraction in association field determines text information in the canonical in association field The score value of expression formula.
In the present embodiment, the default of field is associated in the above-mentioned preset database of the characteristic matching in response to extraction Feature, the weighted value by default feature in association field are determined as the feature extracted after the weighted value in association field, can be with By the feature of extraction based on the weighted value in the field of association, determine text information in point of the regular expression in association field Value.
In some optional implementations of the present embodiment, based on the feature of extraction association field weighted value, determine Text information can include but is not limited in the score value of the regular expression in association field: in association field, by the spy of extraction The weighted value that the feature of regular expression is hit in sign is added, and obtains text information in point of the regular expression in association field Value.
In this implementation, the score value of this rule of regular expression is the weighted value phase for hitting its feature of extraction In addition and, it may be assumed that
Wherein, WeightRuleIndicate the weighted value of this rule of regular expression, WeightFeature iIndicate ith feature Weighted value, the value range of i are from 1 to n, and n indicates that the feature for hitting the extraction of the regular expression is n.
By taking weather field as an example, the weighted value of the different characteristic in weather field is approximately as shown in table:
The regular expression in weather field is expressed as follows:
First regular expression: (weather) (how | good or not | how)?
Second regular expression: (temperature) (how many)? (degree)?
So, for short sentence " how is weather " can match the first regular expression (weather) (how | good or not | How)? this is regular, then the score value of the rule are as follows:
0.0328802+0.00745463=0.04033483.
In some optional implementations of the present embodiment, filter word is preset with hitting in the above-mentioned feature for filtering out extraction The feature of table obtains filtered feature;In response to being associated with the default of field in the preset database of filtered characteristic matching Feature, by default feature association field weighted value be determined as filtered feature association field weighted value it is corresponding, It is above-mentioned in association field, the weighted value of the feature that regular expression is hit in the feature of extraction is added, text envelope is obtained The score value ceased in the regular expression in association field can include but is not limited to:, will be in filtered feature in association field The weighted value for hitting the feature of regular expression is added, and obtains the score value of the regular expression of text information.
In some optional implementations of the present embodiment, based on the feature of extraction association field weighted value, determine Text information can also include but is not limited in the score value of the regular expression in association field: obtain text information by following steps Regular expression in association field: the type label of entity information is identified from the feature of extraction;In response to the type of identification It is associated with the preset type label that the regular expression in field has in tag match initialized data base, there will be preset type Regular expression of the regular expression of label as text information in association field, wherein preset database may include But it is not limited to the regular expression with preset kind label in multiple fields.
Identify that the type label of entity information can include but is not limited in the above-mentioned feature from extraction: from the spy of extraction The positional relationship between the verb and noun and verb and noun of entity information is identified in sign;And the type in response to identification It is associated with the preset type label that the regular expression in field has in tag match initialized data base, there will be preset type Regular expression of the regular expression of label as text information in association field can include but is not limited to: in response to identification Verb and noun and verb and noun between positional relationship matching initialized data base in be associated with field regular expression Positional relationship between the preset verb having and noun and verb and noun, will have preset verb and noun and Regular expression of the regular expression of positional relationship between verb and noun as text information in association field.
The type label that entity information is identified in the above-mentioned feature from extraction, can use knowledge well known in the prior art Recognition methods in other method or WeiLai Technology realizes that the application does not limit this.For example, condition random field can be used CRF algorithm identifies the type label of entity information from the feature of extraction.
Step 205, score value is ranked up, the regular expression of preset quantity is obtained according to the result of sequence.
It in the present embodiment, can regular expression for text information determining in step 204 in association field Score value be ranked up.Wherein, preset quantity can be one or more, can be according to the setting of user or technological development personnel Come determine acquisition regular expression quantity.For example, preset quantity can be set as three, according to score value from it is high to low into After row sequence, highest three regular expressions that sort are obtained;Preset quantity also can be set as one, according to score value from height To low be ranked up, the highest regular expression that sorts only is obtained.
Step 206, parsing text of the regular expression that will acquire as spoken language text information.
In the present embodiment, can will be made in step 205 according to the regular expression that the result of sequence obtains preset quantity For the parsing text of spoken language text information.For example, using highest three regular expressions of the sequence of above-mentioned acquisition as spoken text The parsing text of this information or the highest regular expression that will sort are as the parsing text of spoken language text information.
When the quantity of the regular expression of acquisition is multiple, the regular expressions obtained further can be presented to user Formula parses text for selection by the user, to improve the accuracy of parsing and promote user experience.
The method provided by the above embodiment of the application, by being segmented received spoken language text information to extract spy Sign is determined the association field of spoken language text information, later in response to the feature of extraction by the noun in the feature extracted later With the default feature for being associated with field in preset database, the weighted value by default feature in association field is determined as the spy extracted The weighted value in association field is levied, the weighted value based on the feature of extraction in association field, determines that text information is being associated with later The score value of the regular expression in field, is then ranked up score value, and the result based on sequence obtains the canonical table of preset quantity Up to formula, parsing text of the regular expression that finally will acquire as spoken language text information improves the semantic parsing result of acquisition Accuracy.
With further reference to Fig. 3, as the realization to method shown in above-mentioned each figure, this application provides a kind of spoken texts of parsing One embodiment of the semantic device of this information, the Installation practice is corresponding with embodiment of the method shown in Fig. 2, the device It specifically can be applied in various electronic equipments.
As shown in figure 3, the semantic device 300 of parsing spoken language text information described in the present embodiment may include but unlimited In: characteristic extracting module 310, field determining module 320, weight determination module 330, score value determining module 340, expression formula obtain Module 350 and parsing text module 360.
Wherein, characteristic extracting module 310 are configured to segment to extract feature received spoken language text information; Field determining module 320 is configured to determine the association field of spoken language text information by the noun in the feature extracted;Weight is true Cover half block 330 is configured to be associated with the default feature in field in the preset database of characteristic matching in response to extraction, will preset Weighted value of the feature in association field is determined as the feature extracted in the weighted value in association field, wherein preset database can To include but is not limited to preset feature in the weighted value of multiple fields, multiple fields can include but is not limited to association field;Point It is worth determining module 340, is configured to the feature based on extraction in the weighted value in association field, determines text information in association field Regular expression score value;Expression formula obtains module 350, is configured to be ranked up score value, be obtained according to the result of sequence Take the regular expression of preset quantity;Text module 360 is parsed, the regular expression for being configured to will acquire is as spoken language text The parsing text of information.
In some optional implementations of the present embodiment, the default feature in weight determination module 330 is in multiple fields Weighted value by determining (not shown) with lower module: the frequency of occurrences obtains module, reverse document-frequency obtain module and Weighted value obtains module.Wherein, the frequency of occurrences obtains module, is configured to default spy in each field of multiple fields The number occurred is levied divided by the total word number for the text information sample for default feature occur, obtains default feature in each field The frequency of appearance;Reverse document-frequency obtains module, is configured to remove the quantity for the text information sample for default feature occur With the quantity of total text information sample, the reverse document-frequency of default feature is obtained, wherein the text information of default feature occur Sample and total text information sample are obtained by the historical data for having parsed semantic spoken language text information;Weighted value obtains mould Block, the frequency for being configured to occur default feature in each field obtain pre- multiplied by the reverse document-frequency of default feature If feature is in the weighted value in each field, and the weighted value according to the default feature in each field, obtain described Weighted value of the default feature in multiple fields.
In some optional implementations of the present embodiment, score value determining module 340 be can include but is not limited to (in figure not Show): it is added submodule, is configured in association field, the power of the feature of regular expression will be hit in the feature of extraction Weight values are added, and obtain text information in the score value of the regular expression in association field.
In some optional implementations of the present embodiment, weight determination module 330 be can include but is not limited to (in figure not Show): feature filters out submodule, is configured to hit the feature of default filtering vocabulary in the feature for filtering out extraction, be filtered Feature afterwards;Weight determines submodule, is configured in response to being associated with field in the preset database of filtered characteristic matching Default feature, by default feature association field weighted value be determined as filtered feature association field weighted value; And addition submodule can be further used for: in association field, the spy of regular expression will be hit in filtered feature The weighted value of sign is added, and obtains the score value of the regular expression of text information.
In some optional implementations of the present embodiment, score value determining module 340 can also include but is not limited to (in figure It is not shown): expression formula determining module can include but is not limited to: type label identification module is configured to the feature from extraction The type label of middle identification entity information;Expression formula matching module is configured to match in response to the type label of identification preset The preset type label that the regular expression in field has is associated in database, by the canonical table with preset type label Regular expression up to formula as text information in association field, wherein preset database can include but is not limited to more The regular expression with preset kind label in a field.
In some optional implementations of the present embodiment, type label identification module is further configured to: from extraction Feature in identify entity information in verb and noun and verb and noun between positional relationship;And expression formula matching Module is further configured to: preset in response to the positional relationship matching between the verb and noun and verb and noun of identification It closes the position being associated in database between the regular expression preset verb having and noun and verb and noun in field System, using the regular expression with the positional relationship between preset verb and noun and verb and noun as text information Regular expression in association field.
It will be understood by those skilled in the art that the semantic device 300 of above-mentioned parsing spoken language text information further includes Other known features, such as processor, memory etc..
It should be appreciated that all modules recorded in device 300 are corresponding with each step in the method with reference to Fig. 2 description. As a result, the operation above with respect to the semantic method description of parsing spoken language text information and feature be equally applicable to device 300 and Module wherein included, details are not described herein.Corresponding module in device 300 can in terminal device and/or server Module cooperates to realize the scheme of the embodiment of the present application.
Below with reference to Fig. 4, it illustrates the calculating of the terminal device or server that are suitable for being used to realize the embodiment of the present application The structural schematic diagram of machine system 400.
As shown in figure 4, computer system 400 includes central processing unit (CPU) 401, it can be read-only according to being stored in Program in memory (ROM) 402 or be loaded into the program in random access storage device (RAM) 403 from storage section 408 and Execute various movements appropriate and processing.In RAM 403, also it is stored with system 400 and operates required various programs and data. CPU401, ROM 402 and RAM 403 is connected with each other by bus 404.Input/output (I/O) interface 405 is also connected to always Line 404.
I/O interface 405 is connected to lower component: the importation 406 including keyboard, mouse etc.;It is penetrated including such as cathode The output par, c 407 of spool (CRT), liquid crystal display (LCD) etc. and loudspeaker etc.;Storage section 408 including hard disk etc.; And the communications portion 409 of the network interface card including LAN card, modem etc..Communications portion 409 via such as because The network of spy's net executes communication process.Driver 410 is also connected to I/O interface 405 as needed.Detachable media 411, such as Disk, CD, magneto-optic disk, semiconductor memory etc. are mounted on as needed on driver 410, in order to read from thereon Computer program be mounted into storage section 408 as needed.
Particularly, in accordance with an embodiment of the present disclosure, it may be implemented as computer above with reference to the process of flow chart description Software program.For example, embodiment of the disclosure includes a kind of computer program product comprising be tangibly embodied in machine readable Computer program on medium, the computer program include the program code for method shown in execution flow chart.At this In the embodiment of sample, which can be downloaded and installed from network by communications portion 409, and/or from removable Medium 411 is unloaded to be mounted.
Flow chart and block diagram in attached drawing are illustrated according to the system of the various embodiments of the application, method and computer journey The architecture, function and operation in the cards of sequence product.In this regard, each box in flowchart or block diagram can generation A part of one module, program segment or code of table, a part of the module, program segment or code include one or more Executable instruction for implementing the specified logical function.It should also be noted that in some implementations as replacements, institute in box The function of mark can also occur in a different order than that indicated in the drawings.For example, two boxes succeedingly indicated are practical On can be basically executed in parallel, they can also be executed in the opposite order sometimes, and this depends on the function involved.Also it wants It is noted that the combination of each box in block diagram and or flow chart and the box in block diagram and or flow chart, Ke Yiyong The dedicated hardware based system of defined functions or operations is executed to realize, or can be referred to specialized hardware and computer The combination of order is realized.
Being described in module involved in the embodiment of the present application can be realized by way of software, can also be by hard The mode of part is realized.Described module also can be set in the processor, for example, can be described as: a kind of processor packet It includes and connects characteristic extracting module, field determining module, weight determination module, score value determining module, expression formula obtains module and parsing Text module.Wherein, the title of these modules does not constitute the restriction to the module itself under certain conditions, for example, feature Extraction module is also described as " being segmented received spoken language text information to extract the module of feature ".
As on the other hand, present invention also provides a kind of nonvolatile computer storage media, the non-volatile calculating Machine storage medium can be nonvolatile computer storage media included in device described in above-described embodiment;It is also possible to Individualism, without the nonvolatile computer storage media in supplying terminal.Above-mentioned nonvolatile computer storage media is deposited One or more program is contained, when one or more of programs are executed by an equipment, so that the equipment: docking The spoken language text information of receipts is segmented to extract feature;The association of spoken language text information is determined by the noun in the feature extracted Field;In response to being associated with the default feature in field in the preset database of characteristic matching of extraction, default feature is led in association The weighted value in domain is determined as the feature extracted in the weighted value in association field, wherein preset database may include but unlimited In presetting feature in the weighted value of multiple fields, multiple fields can include but is not limited to association field;Feature based on extraction Weighted value in association field determines text information in the score value of the regular expression in association field;Score value is ranked up, root The regular expression of preset quantity is obtained according to the result of sequence;Solution of the regular expression that will acquire as spoken language text information Analyse text.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.Those skilled in the art Member is it should be appreciated that invention scope involved in the application, however it is not limited to technology made of the specific combination of above-mentioned technical characteristic Scheme, while should also cover in the case where not departing from the inventive concept, it is carried out by above-mentioned technical characteristic or its equivalent feature Any combination and the other technical solutions formed.Such as features described above has similar function with (but being not limited to) disclosed herein Can technical characteristic replaced mutually and the technical solution that is formed.

Claims (12)

1. a kind of semantic method of parsing spoken language text information, comprising:
Received spoken language text information is segmented to extract feature;
The association field of the spoken language text information is determined by the noun in the feature extracted;
The default feature in field is associated with described in the preset database of characteristic matching in response to the extraction, by the default spy Levy weighted value of the feature in the association field that the weighted value in the association field is determined as the extraction, wherein described Preset database includes default feature in the weighted value of multiple fields, and the multiple field includes the association field;
Feature based on the extraction determines the text information in the association field in the weighted value in the association field The score value of regular expression;
The score value is ranked up, the regular expression of preset quantity is obtained according to the result of sequence;
Parsing text of the regular expression that will acquire as the spoken language text information.
2. the method according to claim 1, wherein the default feature multiple fields weighted value by with Lower processing determines:
The number for default feature occur in each field of multiple fields is divided by the text information sample for default feature occur This total word number obtains the frequency that default feature occurs in each field;
By the quantity for the text information sample for the default feature occur divided by the quantity of total text information sample, obtain described pre- If the reverse document-frequency of feature, wherein the text information sample for the default feature occur and total text envelope Breath sample is obtained by the historical data for having parsed semantic spoken language text information;
The frequency that the default feature is occurred in each field obtains pre- multiplied by the reverse document-frequency of the default feature If feature is in the weighted value in each field, and the weighted value according to the default feature in each field, obtain described Weighted value of the default feature in multiple fields.
3. method according to any one of claims 1 or 2, which is characterized in that the feature based on the extraction exists The weighted value in the association field determines that the text information includes: in the score value of the regular expression in the association field
In the association field, the weighted value that the feature of regular expression is hit in the feature of the extraction is added, is obtained Score value of the text information in the regular expression in the association field.
4. according to the method described in claim 3, it is characterized in that, the preset number of the characteristic matching in response to the extraction According to the default feature for being associated with field described in library, the weighted value by the default feature in the association field is determined as described mention The feature taken is in the spy that the weighted value in the association field includes: the default filtering vocabulary of hit in the feature for filter out the extraction Sign, obtains filtered feature;In response to the pre- of association field described in the filtered preset database of characteristic matching If feature, the weighted value by the default feature in the association field is determined as the filtered feature and leads in the association The weighted value in domain;And
It is described to be added the weighted value that the feature of regular expression is hit in the feature of the extraction in the association field, Obtain the text information the score value of the regular expression in the association field include: in the association field, will be described The weighted value that the feature of regular expression is hit in filtered feature is added, and obtains the regular expression of the text information Score value.
5. according to the method described in claim 4, it is characterized in that, the feature based on the extraction is in the association field Weighted value, determine the text information in the score value of the regular expression in the association field further include:
Text information is obtained in the regular expression in the association field by following steps:
The type label of entity information is identified from the feature of the extraction;
The regular expression in association field described in the type label matching initialized data base in response to identification has preset Type label, the canonical using the regular expression with preset type label as the text information in the association field Expression formula, wherein the initialized data base includes the regular expression with preset kind label in the multiple field.
6. according to the method described in claim 5, it is characterized in that, described identify entity information from the feature of the extraction Type label includes: from the position between the verb, noun and verb and noun for identifying entity information in the feature of the extraction Set relationship;And
The regular expression in association field described in the type label matching initialized data base in response to identification has pre- If type label, using the regular expression with preset type label as the text information in the association field Regular expression includes: the positional relationship matching preset data between verb, noun and verb and noun in response to identification It closes the position between preset verb, noun and verb and noun that the regular expression of association field described in library has System, using the regular expression with the positional relationship between preset verb, noun and verb and noun as the text Regular expression of the information in the association field.
7. a kind of semantic device of parsing spoken language text information, comprising:
Characteristic extracting module, for being segmented received spoken language text information to extract feature;
Field determining module determines the association field of the spoken language text information for the noun in the feature by extracting;
Weight determination module, for the default of association field described in the preset database of characteristic matching in response to the extraction Feature, the weighted value by the default feature in the association field are determined as the feature of the extraction in the association field Weighted value, wherein the preset database includes default feature in the weighted value of multiple fields, and the multiple field includes institute State association field;
Score value determining module, the weighted value for the feature based on the extraction in the association field, determines the text envelope Cease the score value in the regular expression in the association field;
Expression formula obtains module, and for being ranked up to the score value, the canonical table of preset quantity is obtained according to the result of sequence Up to formula;
Parse text module, parsing text of the regular expression as the spoken language text information for will acquire.
8. device according to claim 7, which is characterized in that the default feature in the weight determination module is more The weighted value in a field with lower module by being determined:
The frequency of occurrences obtains module, and the number for there is default feature in each field of multiple fields is divided by appearance Total word number of the text information sample of default feature obtains the frequency that default feature occurs in each field;
Reverse document-frequency obtains module, for that the quantity of the text information sample of the default feature will occur divided by total text The quantity of message sample obtains the reverse document-frequency of the default feature, wherein the text for the default feature occur Message sample and total text information sample are obtained by the historical data for having parsed semantic spoken language text information;
Weighted value obtains module, and the frequency for occurring the default feature in each field is multiplied by the default feature Reverse document-frequency obtains default feature in the weighted value in each field, and according to the default feature in each neck The weighted value in domain obtains the default feature in the weighted value of multiple fields.
9. according to device described in claim 7 or 8 any one, which is characterized in that the score value determining module includes:
It is added submodule, for the feature of regular expression will to be hit in the feature of the extraction in the association field Weighted value is added, and obtains the text information in the score value of the regular expression in the association field.
10. device according to claim 9, which is characterized in that the weight determination module includes: that feature filters out submodule Block, the feature of the default filtering vocabulary of hit, obtains filtered feature in the feature for filtering out the extraction;Weight determines son Module, for the default feature in response to being associated with field described in the filtered preset database of characteristic matching, by institute It states weighted value of the default feature in the association field and is determined as the filtered feature in the weighted value in the association field; And
The addition submodule includes: that regular expression will be hit in the filtered feature in the association field The weighted value of feature is added, and obtains the score value of the regular expression of the text information.
11. device according to claim 10, which is characterized in that the score value determining module further include:
Expression formula determining module, comprising:
Type label identification module, for identifying the type label of entity information from the feature of the extraction;
Expression formula matching module, for being associated with the canonical in field described in the type label matching initialized data base in response to identification The preset type label that expression formula has, exists the regular expression with preset type label as the text information The regular expression in the association field, wherein the initialized data base includes having preset kind in the multiple field The regular expression of label.
12. device according to claim 11, which is characterized in that the type label identification module is further used for: from Verb, noun in the feature of the extraction in identification entity information and the positional relationship between verb and noun;And
The expression formula matching module is further used for: in response to the position between the verb, noun and verb and noun of identification Set the regular expression preset verb, noun and the verb that have that field is associated with described in relationship match initialized data base and Positional relationship between noun, by the canonical table with the positional relationship between preset verb, noun and verb and noun Regular expression up to formula as the text information in the association field.
CN201510977813.0A 2015-12-23 2015-12-23 Parse the semantic method and apparatus of spoken language text information Active CN105786793B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510977813.0A CN105786793B (en) 2015-12-23 2015-12-23 Parse the semantic method and apparatus of spoken language text information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510977813.0A CN105786793B (en) 2015-12-23 2015-12-23 Parse the semantic method and apparatus of spoken language text information

Publications (2)

Publication Number Publication Date
CN105786793A CN105786793A (en) 2016-07-20
CN105786793B true CN105786793B (en) 2019-05-28

Family

ID=56390284

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510977813.0A Active CN105786793B (en) 2015-12-23 2015-12-23 Parse the semantic method and apparatus of spoken language text information

Country Status (1)

Country Link
CN (1) CN105786793B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106547742B (en) * 2016-11-30 2019-05-03 百度在线网络技术(北京)有限公司 Semantic parsing result treating method and apparatus based on artificial intelligence
CN106649278B (en) * 2016-12-30 2019-11-15 三星电子(中国)研发中心 Extend the method and system of spoken dialogue system corpus
CN109388796B (en) * 2017-08-11 2023-04-18 北京国双科技有限公司 Method and device for pushing referee document
CN107705784B (en) * 2017-09-28 2020-09-29 百度在线网络技术(北京)有限公司 Text regularization model training method and device, and text regularization method and device
CN108197105B (en) * 2017-12-28 2021-08-24 Oppo广东移动通信有限公司 Natural language processing method, device, storage medium and electronic equipment
CN109388700A (en) * 2018-10-26 2019-02-26 广东小天才科技有限公司 A kind of intension recognizing method and system
CN109800423A (en) * 2018-12-21 2019-05-24 广州供电局有限公司 Method and apparatus are determined based on the power-off event of power failure plan sentence
CN111401057B (en) * 2018-12-29 2023-11-14 深圳Tcl新技术有限公司 Semantic analysis method, storage medium and terminal equipment
CN109783821B (en) * 2019-01-18 2023-06-27 广东小天才科技有限公司 Method and system for searching video of specific content
CN109766555B (en) * 2019-01-18 2023-06-27 广东小天才科技有限公司 Method and system for acquiring semantic slots of user sentences
CN109800430B (en) * 2019-01-18 2023-06-27 广东小天才科技有限公司 Semantic understanding method and system
CN112151019A (en) * 2019-06-26 2020-12-29 阿里巴巴集团控股有限公司 Text processing method and device and computing equipment
CN111680136B (en) * 2020-04-28 2023-08-25 平安科技(深圳)有限公司 Method and device for semantic matching of spoken language
CN113064981A (en) * 2021-03-26 2021-07-02 北京达佳互联信息技术有限公司 Group head portrait generation method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103425635A (en) * 2012-05-15 2013-12-04 北京百度网讯科技有限公司 Method and device for recommending answers
CN105095186A (en) * 2015-07-28 2015-11-25 百度在线网络技术(北京)有限公司 Semantic analysis method and device
CN105138575A (en) * 2015-07-29 2015-12-09 百度在线网络技术(北京)有限公司 Analysis method and device of voice text string

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20130068612A (en) * 2011-12-15 2013-06-26 한국전자통신연구원 Apparatus and method for normalizing text

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103425635A (en) * 2012-05-15 2013-12-04 北京百度网讯科技有限公司 Method and device for recommending answers
CN105095186A (en) * 2015-07-28 2015-11-25 百度在线网络技术(北京)有限公司 Semantic analysis method and device
CN105138575A (en) * 2015-07-29 2015-12-09 百度在线网络技术(北京)有限公司 Analysis method and device of voice text string

Also Published As

Publication number Publication date
CN105786793A (en) 2016-07-20

Similar Documents

Publication Publication Date Title
CN105786793B (en) Parse the semantic method and apparatus of spoken language text information
CN105654950B (en) Adaptive voice feedback method and device
CN108241667B (en) Method and apparatus for pushed information
CN107766371B (en) Text information classification method and device
CN109117777A (en) The method and apparatus for generating information
CN110209812B (en) Text classification method and device
WO2017024553A1 (en) Information emotion analysis method and system
CN110349564A (en) Across the language voice recognition methods of one kind and device
CN112395420A (en) Video content retrieval method and device, computer equipment and storage medium
US20180329985A1 (en) Method and Apparatus for Compressing Topic Model
CN114549874A (en) Training method of multi-target image-text matching model, image-text retrieval method and device
US10915756B2 (en) Method and apparatus for determining (raw) video materials for news
CN111160007B (en) Search method and device based on BERT language model, computer equipment and storage medium
CN115982376B (en) Method and device for training model based on text, multimode data and knowledge
CN107943895A (en) Information-pushing method and device
US20220121668A1 (en) Method for recommending document, electronic device and storage medium
CN109284367A (en) Method and apparatus for handling text
CN110457694A (en) Message prompt method and device, scene type identification based reminding method and device
CN111861596A (en) Text classification method and device
CN109190123A (en) Method and apparatus for output information
CN106815224A (en) Service acquisition method and apparatus
CN114298007A (en) Text similarity determination method, device, equipment and medium
CN110245357A (en) Principal recognition methods and device
CN109213916A (en) Method and apparatus for generating information
CN109670111A (en) Method and apparatus for pushed information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant