WO2014203328A1 - Système de recherche de données vocales, procédé de recherche de données vocales et support d'informations lisible par ordinateur - Google Patents

Système de recherche de données vocales, procédé de recherche de données vocales et support d'informations lisible par ordinateur Download PDF

Info

Publication number
WO2014203328A1
WO2014203328A1 PCT/JP2013/066690 JP2013066690W WO2014203328A1 WO 2014203328 A1 WO2014203328 A1 WO 2014203328A1 JP 2013066690 W JP2013066690 W JP 2013066690W WO 2014203328 A1 WO2014203328 A1 WO 2014203328A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
search
information
voice
speech
Prior art date
Application number
PCT/JP2013/066690
Other languages
English (en)
Japanese (ja)
Inventor
龍 武田
直之 神田
藤田 雄介
康成 大淵
Original Assignee
株式会社日立製作所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 株式会社日立製作所 filed Critical 株式会社日立製作所
Priority to PCT/JP2013/066690 priority Critical patent/WO2014203328A1/fr
Publication of WO2014203328A1 publication Critical patent/WO2014203328A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/54Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for retrieval
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L2015/088Word spotting

Definitions

  • the present invention relates to a voice data search system, a voice data search method, and a computer-readable storage medium, for example, a technique for searching for a specific keyword from voice data.
  • voice call data for thousands of hours a day specifically, operator and customer voices are often recorded in pairs. These are recorded for operator training and confirmation of received contents, and a voice database is used as necessary.
  • customer voices contain information such as product names, product defects, and complaints that need to be heard efficiently and put together in reports.
  • information on the time when voice is recorded is given to the voice data, and desired voice data is searched based on the time information. In the search based on the time information, it is necessary to know in advance the time when the desired voice is uttered, so that it is not suitable for use in searching for a voice with a specific utterance. When searching for a voice with a specific utterance, in the conventional method, it is necessary to listen to the voice data from the beginning to the end.
  • the subword search method which is one of representative methods, first, speech data is converted into a subword string by a subword recognition process.
  • the subword is a name indicating a general unit system smaller than a word, such as a phoneme or a syllable.
  • the subword expression of the input keyword is compared with the subword recognition result of the speech data, and the distance between the subwords is calculated according to some criteria. By sorting the search results in descending order using the calculated distance as a score, the time when the keyword is spoken is detected on the voice data.
  • Patent Document 1 Japanese Patent Laid-Open No. 2004-133867 discloses an input search when a search result obtained by searching an input search keyword from a speech database and an appearance location (time) of a search result by a co-occurrence keyword related to the input search keyword are close. It discloses that an evaluation value (score value) of a search result by a keyword is increased.
  • Patent Document 1 a keyword and its co-occurrence keyword are searched only from the voice data of the speaker to be searched, and a score is given based on the search result.
  • subword recognition is difficult for customer's voice due to the influence of noise and the diversity of speaker characteristics, and keyword misdetection increases. For this reason, the technique such as Patent Document 1 has a problem that unnecessary search results rise to the top and search accuracy decreases.
  • the present invention has been made in view of such a situation, and provides a technique for realizing high-accuracy speech data retrieval.
  • the present invention it is possible to improve the search accuracy by correcting the score using the status of the speech section of another speaker before and after the speech section of the keyword search result, for example, the related keyword and the silent section length.
  • FIG. 10 is a flowchart for explaining processing by a related information data selection unit 1305; It is a figure which shows the structural example of the audio
  • 14 is a diagram illustrating an example of a format of audio data stored in a storage device 1719.
  • FIG. It is a figure which shows schematic structure of a general content cloud system. It is a figure which shows schematic structure of the audio
  • the present invention uses the related keyword information and silent section length information included in the operator utterance when extracting the keyword from the voice of the customer, for example, the situation of the other speaker's voice section before and after the speech section of the keyword search result,
  • the search accuracy is improved by correcting the search score value of the search result by the input search keyword.
  • the present invention is made by a call center practitioner paying attention to an operator's response status, for example, silent section length and emotion information, when confirming whether customer voice data is a complaint, for example.
  • the embodiment of the present invention may be implemented by software running on a general-purpose computer, or may be implemented by dedicated hardware or a combination of software and hardware.
  • each information of the present invention will be described in a “table” format.
  • the information does not necessarily have to be expressed in a data structure by a table, such as a data structure such as a list, a DB, a queue, or the like. It may be expressed as Therefore, “table”, “list”, “DB”, “queue”, etc. may be simply referred to as “information” to indicate that they do not depend on the data structure.
  • program as a subject (operation subject).
  • a program is executed by a processor and a process determined by a memory and a communication port (communication control device). Since it is performed while being used, the description may be made with the processor as the subject.
  • the processing disclosed with the program as the subject may be processing performed by a computer such as a management server or an information processing apparatus. Part or all of the program may be realized by dedicated hardware, or may be modularized.
  • Various programs may be installed in each computer by a program distribution server or a storage medium.
  • the first embodiment relates to a stand-alone voice data retrieval apparatus.
  • FIG. 1 is a diagram showing a configuration of a speech data retrieval apparatus 1 according to the first embodiment of the present invention.
  • the speech data search apparatus 1 includes learning-labeled speech data (storage unit) 101, an acoustic / language model learning unit 102, an acoustic model 103, a language model (storage unit) 104, and search target data (storage unit). 105, indexing / speech information extraction unit 106, index / speech information data (storage unit) 107, dialogue order analysis unit 108, dialogue order data (storage unit) 109, keyword input unit 110, and related information input Unit 111, candidate position evaluation unit 112, search result integration unit 113, and search result display unit 114.
  • the learning-labeled speech data (storage unit) 101 is learning data prepared in advance, and stores a speech waveform of an unspecified number of speakers and text that transcribes the utterance content with a label. . If the voice data is accompanied by a written text, the voice data may be a voice track extracted from the TV, a reading voice corpus, or a normal conversation. Of course, an ID for identifying the speaker and a label such as the presence or absence of noise may be attached.
  • the acoustic model / language model learning unit 102 sets parameters of each statistical model using the learning-labeled speech data 101.
  • the problem of recognizing speech data can result in, for example, a posterior probability maximization search problem.
  • a solution is obtained based on an acoustic model and a language model learned from a large amount of learning data.
  • Processing for estimating parameters of the acoustic model and the language model is performed using the learning-labeled speech data 101.
  • HMM Hidden Markov Model
  • N-Gram may be adopted as the language model.
  • the acoustic model (storage unit) 103 stores parameters of a statistical model that expresses a voice feature (for example, a feature of the sound of “A”).
  • the language model (storage unit) 104 includes language features (features of connection between words: for example, the word “ha” is connected after the word “dinner”, or the word “dinner” is “eating”. , Etc.), which stores the parameters of a statistical model that represents.
  • Search target data (storage unit) 105 stores voice data to be searched, voice extracted from TV, conference voice, recorded voice on a telephone line (for example, utterance record), and the like.
  • the audio data may be recorded in a plurality of files by type, a plurality of channels may be recorded, or metadata information such as a speaker identification ID may be given.
  • the indexing / speech information extraction unit 106 detects an utterance section from the search target data 105, performs subword recognition using the acoustic model 103 and the language model 104, and obtains a subword recognition result, an N-gram index based on the subword, and other information.
  • the included index audio information data is generated and stored in the index / audio information data (storage unit) 107.
  • the dialog order analysis unit 108 reads the utterance section information, the audio file channel information, and the metadata information detected by the indexing / speech information extraction unit 106 from the index / speech information data (storage unit) 107, and uses these information.
  • the dialogue order data is generated and stored in the dialogue order data (storage unit) 109. More specifically, with reference to the metadata, a process of identifying which person's utterance data appears after the utterance of a specific person and associating the index voice data with the information of the order of dialogue is performed. For example, if the stored data is call voice recording data, two-way calls are recorded in different channels of the same audio file, or conversations of multiple speakers are recorded in separate files, but are linked by metadata. There may be. Here, first, a set of files in which conversations on the same time are recorded is obtained based on channel information and metadata information. This is the preprocessing part in the speech data retrieval apparatus.
  • the keyword input unit 110 receives a search keyword input by the user, converts it to a subword string if necessary, and outputs the converted subword string to the candidate position evaluation unit 112.
  • the related information input unit 111 receives and analyzes data (related words and related information of a search keyword) input by a user, and sets various parameters such as related keywords used in the search, silent section information, and weights to the candidate position evaluation unit 112. Output to.
  • data related words and related information of a search keyword
  • the related words can include station names, departure times, route names, etc.
  • the related information includes silent section length information and utterance length.
  • Information for example, information that the utterance time of the customer is more than twice the utterance time of the operator).
  • the candidate position evaluation unit 112 includes a subword string of search keywords output from the keyword input unit 110, search related keywords and silent section information output from the related information input unit 111, and parameters thereof (hereinafter referred to as related information), and indexes / voices.
  • related information parameters thereof (hereinafter referred to as related information), and indexes / voices.
  • the search result integration unit 113 sorts the search candidates output by the candidate position evaluation unit 112 based on the score, and outputs the search results to the search result display unit 114 as a search result.
  • the search result display unit 114 forms the search candidate appearance file name, time, score, and the like, and transmits the search result output by the search result integration unit 113 to the output device.
  • the steps up to here are the part of the search process in the voice data search apparatus 1.
  • the sorting algorithm can use a well-known quick sort, radix sort, or the like.
  • the sorted search results include the file name, time, and score at which each search candidate is determined to have been uttered. This search result is sent to the search result display unit 114, but it is also possible to send only the search result to another application.
  • the search result display unit 114 transfers the search results from the top of the score in the display format of the display and displays them on the display.
  • the voice data search device 1 has been described as a single device, but may be configured by a system including a terminal (browser) and a computer (server).
  • the terminal (browser) executes processing of the keyword input unit 110 and the search result display unit 114
  • the computer (server) executes processing of other processing units.
  • the search target data 105, the learning-labeled voice data 101, the acoustic model 103, the language model 104, the index / voice information data 107, and the dialogue order data 109 are stored and generated in the same apparatus.
  • FIG. the search target data 105 is stored in an external storage, and the index / voice information data 107, the dialogue order data 109, the acoustic model 103, and the language model 104 are created in advance by another computer, and the search process is executed. It can be copied to a computer and used.
  • FIG. 2 is a flowchart for explaining audio data registration processing executed by the indexing / audio information extracting unit 106 in the present embodiment of the present invention.
  • the indexing / voice information extraction unit 106 selects all voices (data for each channel: where ch is, for example, telephone conversation data, uplink ch (customer utterance) and downlink ch (operator utterance)).
  • ch is, for example, telephone conversation data, uplink ch (customer utterance) and downlink ch (operator utterance)).
  • the audio data of a plurality of files of the search target data 105 are divided into appropriate lengths (step 202). For example, when the time during which the audio power is equal to or less than the predetermined threshold ⁇ p continues for the predetermined threshold ⁇ t or more, the audio data may be divided at that position.
  • FIG. 3 shows the audio data divided in this way. In FIG. 3, information indicating the original file and the start time (301) and end time information (302) of the divided sections are given to each audio section.
  • a method using the number of zero crossings a method using a GMM (Gaussian Mixture Model), a method using a voice recognition technique, etc.
  • GMM Gausian Mixture Model
  • voice information such as emotion and speech speed information may be extracted. Since the method for realizing these can be performed by combining known techniques, details are omitted.
  • the indexing / speech information extraction unit 106 performs subword conversion processing on all speech sections (step 203). Specifically, the audio data is converted into subword units. Next, the converted subword string (subword recognition result), time corresponding to the subword N-gram, speech section information, and other voice information (such as voice time length), metadata (speaker ID, operator ID, date, Customer telephone number information, channel information of each speaker, etc.) are stored in the index / voice information data 107 (step 204).
  • the audio data registration process may be performed only once during the initial operation. When this voice data registration process is completed, a keyword search becomes possible.
  • only the so-called 1-best recognition result is stored in the index table, but a plurality of speech recognition results may be output in the N-best format or the network format.
  • FIG. 4 is a diagram illustrating a configuration example of information stored in the index / audio information data (storage unit) 107.
  • ID 401 is the management number of the database and indicates the ID of the audio file.
  • File name-ch 402 is an audio file name and channel number.
  • xxx. wav 0ch indicates the file name of the operator's utterance and the channel number on which it was uttered.
  • xxx. wav 1ch indicates the file name of the customer's utterance and the channel number on which the utterance is spoken.
  • the N-gram index 403 is a column for recording a pair of the S-ID (ID included in the subword recognition result) of the subword N-gram index of the audio file and its appearance position. From the information of the N-gram index 403 in FIG. 4, the sub-word N-gram w-En has the 0th place of the index of the sub-word sequence with the S-ID of 0 and the sub-word sequence with the S-ID of 5. It can be seen that the index appears at the eleventh place.
  • the subword recognition result 404 is information including S-ID that is a subword ID and subword string information.
  • an utterance section ID (S-ID) in the voice file a subword recognition result of the section, and an utterance section and its length are recorded.
  • FIG. 6 is a diagram illustrating an example of a conversation on the same time.
  • a number 601 is assigned to each utterance section of each file.
  • utterances existing around a certain utterance section of a certain file are linked. This may be done by focusing on each utterance section and listing the utterance sections of another audio file or channel that falls within an appropriate time range. For example, xxx. In the vicinity (front and back) of the utterance section 0 of wavch0, xxx. It can be seen that utterance sections 2 and 3 of wavch1 exist.
  • the related information input unit 111 further determines whether or not it is a sub-word string (step 1002).
  • the related information input unit 111 converts the corresponding word into a subword string (step 1003).
  • the candidate position evaluation unit 112 determines whether or not the silent data is included in the input data from the related information input unit 111 (specifically, whether or not a flag that uses the silent data is input to the input data). Is determined (step 1105). If silence section information is included, the process proceeds to step 1106. If not included, the process proceeds to step 1107.
  • the candidate position evaluation unit 112 corrects the search candidate score calculated in step 1102 using the various scores calculated in step 1104 and / or step 1107 (step 1107). Specifically, the correction score of the candidate section is calculated according to the following formula 1. [Formula 1]
  • FIG. 13 is a diagram illustrating a configuration example showing the voice data search device 2 according to the second embodiment.
  • the blocks given the same reference numerals shown in FIG. 1 already described have the same functions, and their descriptions are omitted.
  • the speech data retrieval apparatus 1 according to the first embodiment is similar to the speech 1301 with learning data label, the text data 1302, the related information data construction unit 1303, and the related information data. 1304 and a related information selection unit 1305 are added.
  • the related information data construction unit 1303 uses the learning data labeled speech 1301 and the text data 1302 to analyze the relationship between the co-occurrence words or words and the silence interval length, and stores the information as the related information data 1304.
  • the related information data construction unit 1303 assigns attribute values of each word, for example, information such as anger, emotion, product name, part of speech, using the emotion word dictionary and the product name list (step 1401).
  • the related information data construction unit 1303 transcribes each word in the word dictionary, which is attached to the learning data labeled speech 1401, enumerates all the appearance positions from the label, and other words around the utterance including the word All the utterance interval lengths of the speakers are counted, and statistics such as the average and variance are calculated (step 1403).
  • the silent section length may be a value given by manually listening to an audio file, or may be a value automatically detected using a speech section detection technique. Further, the appearance frequency of the silent section length or the prior probability itself corresponds to the weight parameter in the score.
  • related words are managed by word IDs. These values are all generated by the related information data construction unit 1303.
  • the word (phrase) “ID” is related to the word (phrase) “ID” of “0”. That is, for example, when the customer utters “Do not play”, the operator often utters “I am sorry”, so the latter is registered as the related voice information of the former.
  • step 1602 the related information data selection unit 1305 acquires information (related words, silent sections and their parameters) related to the input search keyword from the related information data (storage unit) 1304.
  • the related data selection unit 1305 selects a similar word group using the subword distance and attribute, and stores the information.
  • the parameters of the input keyword are predicted and output.
  • related speech information and parameters of a word having the nearest phoneme distance to the input keyword may be output. For example, if "I'm sorry" is registered as related information data, but the input word is "I'm sorry” and the word itself is not registered, the related voice information about "I'm sorry" Will be output.
  • a third embodiment relates to a system that can be introduced into a call center by adding a telephone line call recording device to the voice data search device 1.
  • FIG. 17 is a diagram illustrating a configuration example of a speech data search system according to the third embodiment.
  • the voice data search system 3 according to the third embodiment corresponds to an example in which the voice data search device 1 according to the first embodiment is applied to a call center.
  • the call recording device 1704 has a general-purpose computer configuration such as a CPU, a memory, and a control program. Also, the call recording device 1704 acquires a voice signal based only on the customer's utterance from the PBX device 1703 or the telephone 1702 used by the operator. Further, the call recording device 1704 acquires a voice signal from the telephone 1702 only by the operator's utterance. It is also possible to acquire a voice signal of only the operator's utterance by preparing a headset and a recording device separately. Thereafter, the call recording device 1704 performs A / D conversion on the audio signal only from the customer and the audio signal only from the operator, and converts it into digital data such as WAV format. The conversion to audio data may be performed by real time processing. These search target data 1706 are stored in the storage device 1719 together with the call management data 1705.
  • the storage device 1720 stores at least a language model 1707, an acoustic model 1708, index / voice information data 1709, and dialogue order data 2210 as data used in the search.
  • the search target data 1706 is accessed at regular intervals, only the difference data is indexed, and added to the index / voice information data 1709 (index table). May be.
  • the content cloud system targets data in any format such as audio data 1901, medical data 1901, and / or mail data 1901 as input.
  • the various data are, for example, call center call voice, mail data, document data, and the like, and may be structured or not.
  • Data input to the content cloud system is temporarily stored in the content storage 1902.
  • the content storage 1904 stores the information extracted by the ETL 1903 and the pre-processing data 1901 temporarily stored in the storage 1902.
  • the search engine 1905 searches the text based on the index created by the ETL 1903, for example, if it is a text search, and transmits the search result to the application program 1908.
  • a publicly known technique can be applied to the search engine and its algorithm.
  • the search engine may include a module that searches not only text but also data such as images and sounds.
  • the multimedia server 1907 pieces of information between metadata extracted by the ETL 1903 are associated with each other, and the metadata is structured in a graph format and stored.
  • the original voice file, image data, related words, and the like are expressed in a network format with respect to the voice recognition result “apple” stored in the content storage 1904.
  • the multimedia server 1907 receives a request from the application 1908, the multimedia server 1907 transmits meta information corresponding to the request to the application 1908. For example, when there is a request for “apple”, related meta information such as an image of an apple, an average market price, and an artist's song name is provided based on the constructed graph structure.
  • FIG. 20 is a diagram showing a schematic configuration of a voice data search system realized by incorporating the function of the voice data search device 1 into the content cloud system.
  • Various functions of the speech data retrieval apparatus 1 are modularized, and an indexing module (indexing / speech information extraction unit 106, dialogue order analysis unit 108) and a search module (keyword input unit 110, related information input unit 111, candidate position evaluation unit 112).
  • the search result integration unit 113 The search result integration unit 113).
  • the acoustic model 103 and the language model 104 are created in advance by another computer and copied to the content cloud system.
  • the indexing module 2001 can be registered in the ETL 1903
  • the search module 2002 can be registered in the multimedia server 1907.
  • the search module 2002 uses the index / voice information data 2003 (corresponding to 107), and the file name and time when the keyword is spoken. Returns a list of scores.
  • the processing of the indexing / voice information extraction module and search module 2002 is only a part of the processing of the voice data search apparatus 1 and will not be described here.
  • the search module 2002 can be set in the search engine 1905. In this case, when a request is made from the allocation program 1908 to the search engine 1905, the search module 2002 transmits the file name, time, and score at which the keyword is spoken in the voice data to the search engine 1905.
  • the voice data search device 2 according to the second embodiment can be introduced into the system according to the third embodiment or incorporated into the content cloud system according to the fourth embodiment. .
  • dialogue order data indicating the utterance order of the voice segment data of the search target data is generated based on the voice file channel information and the voice metadata information included in the index voice information data.
  • the score value (first score value) between the search keyword and the voice section data included in the index voice information data is calculated, and a plurality of search result candidates are acquired.
  • the voice segment data around each of the plurality of search result candidates is specified based on the dialogue order data.
  • related information related to the search keyword is acquired (when the user inputs it or when it is acquired from the related information data storage unit (DB)), and the score between the related information and the speech section data around the search result publication A value (second score value) is calculated.
  • the first score value is corrected using the second score value, and a plurality of search result candidates are sorted and output using the corrected score value.
  • the score value between the search keyword and the search target data is corrected with the score value based on the related information, so that the search accuracy can be improved.
  • related information not only related words (words having a high co-occurrence) related to the search keyword, but also information on silent section length, speech section length to be searched, and other speakers before and after the speech section. Information on the ratio of the length of the voice interval can also be used.
  • the score value correction is performed using the silent section length information, by calculating the relative relationship of the silent section lengths around the speech section including each of the search candidates (specified by the dialogue order data) A second score value is calculated. By doing in this way, it becomes possible to implement
  • a related information database to be stored may be provided.
  • related information related to the search keyword is acquired from the related information database.
  • search candidate score value can be corrected, so that the search accuracy can be improved.
  • the present invention can also be realized by software program codes that implement the functions of the embodiments.
  • a storage medium in which the program code is recorded is provided to the system or apparatus, and the computer (or CPU or MPU) of the system or apparatus reads the program code stored in the storage medium.
  • the program code itself read from the storage medium realizes the functions of the above-described embodiments, and the program code itself and the storage medium storing the program code constitute the present invention.
  • a storage medium for supplying such program code for example, a flexible disk, CD-ROM, DVD-ROM, hard disk, optical disk, magneto-optical disk, CD-R, magnetic tape, nonvolatile memory card, ROM Etc. are used.
  • an OS operating system
  • the computer CPU or the like performs part or all of the actual processing based on the instruction of the program code.
  • the program code is stored in a storage means such as a hard disk or a memory of a system or apparatus, or a storage medium such as a CD-RW or CD-R
  • the computer (or CPU or MPU) of the system or apparatus may read and execute the program code stored in the storage means or the storage medium when used.
  • control lines and information lines are those that are considered necessary for the explanation, and not all control lines and information lines on the product are necessarily shown. All the components may be connected to each other.
  • Speech data with learning 102
  • Acoustic model / language model learning unit 103
  • Acoustic model 104
  • Language model 105
  • Search target data 106
  • Indexing / speech information extraction unit 107
  • Index / speech information data 108
  • Dialogue order analysis unit 109
  • Dialogue order data 110
  • Keyword input Unit 111 related information input unit 112 candidate position evaluation unit 113 search result integration unit 114 search result display unit

Abstract

L'invention concerne une technique pour parvenir à une recherche de données vocales précise. La présente invention reçoit un mot-clé de recherche et calcule des premières valeurs de score, qui sont des valeurs de score entre le mot-clé de recherche et des éléments de données de section vocale incluses dans des données d'informations vocales indexées, permettant ainsi d'obtenir une pluralité de résultats de recherche candidats. La présente invention identifie ensuite des éléments de données de section vocale adjacents à chacun de la pluralité de résultats de recherche candidats sur la base de données de séquence de dialogue. En outre, la présente invention obtient des informations associées au mot-clé de recherche et calcule des secondes valeurs de score, qui sont des valeurs de score entre les informations associées et les éléments adjacents de données de section vocale. La présente invention corrige ensuite les premières valeurs de score à l'aide des secondes valeurs de score, délivre les valeurs de score corrigées, trie la pluralité de résultats de recherche candidats par utilisation des valeurs de score corrigées, et délivre la pluralité triée de résultats de recherche candidats.
PCT/JP2013/066690 2013-06-18 2013-06-18 Système de recherche de données vocales, procédé de recherche de données vocales et support d'informations lisible par ordinateur WO2014203328A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/JP2013/066690 WO2014203328A1 (fr) 2013-06-18 2013-06-18 Système de recherche de données vocales, procédé de recherche de données vocales et support d'informations lisible par ordinateur

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/JP2013/066690 WO2014203328A1 (fr) 2013-06-18 2013-06-18 Système de recherche de données vocales, procédé de recherche de données vocales et support d'informations lisible par ordinateur

Publications (1)

Publication Number Publication Date
WO2014203328A1 true WO2014203328A1 (fr) 2014-12-24

Family

ID=52104095

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/JP2013/066690 WO2014203328A1 (fr) 2013-06-18 2013-06-18 Système de recherche de données vocales, procédé de recherche de données vocales et support d'informations lisible par ordinateur

Country Status (1)

Country Link
WO (1) WO2014203328A1 (fr)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108027823A (zh) * 2015-07-13 2018-05-11 帝人株式会社 信息处理装置、信息处理方法以及计算机程序
CN110211592A (zh) * 2019-05-17 2019-09-06 北京华控创为南京信息技术有限公司 智能语音数据处理装置及方法
WO2020121115A1 (fr) * 2018-12-13 2020-06-18 株式会社半導体エネルギー研究所 Procédé de classification de contenu et procédé de génération de modèle de classification
CN112069796A (zh) * 2020-09-03 2020-12-11 阳光保险集团股份有限公司 一种语音质检方法、装置,电子设备及存储介质
CN113204685A (zh) * 2021-04-25 2021-08-03 Oppo广东移动通信有限公司 资源信息获取方法及装置、可读存储介质、电子设备
CN115132198A (zh) * 2022-05-27 2022-09-30 腾讯科技(深圳)有限公司 数据处理方法、装置、电子设备、程序产品及介质

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007218933A (ja) * 2006-02-14 2007-08-30 Hitachi Ltd 会話音声分析方法、及び、会話音声分析装置
JP2010267012A (ja) * 2009-05-13 2010-11-25 Hitachi Ltd 音声データ検索システム及び音声データ検索方法
JP2011070192A (ja) * 2009-09-22 2011-04-07 Ricoh Co Ltd 音声検索装置及び音声検索方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2007218933A (ja) * 2006-02-14 2007-08-30 Hitachi Ltd 会話音声分析方法、及び、会話音声分析装置
JP2010267012A (ja) * 2009-05-13 2010-11-25 Hitachi Ltd 音声データ検索システム及び音声データ検索方法
JP2011070192A (ja) * 2009-09-22 2011-04-07 Ricoh Co Ltd 音声検索装置及び音声検索方法

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108027823A (zh) * 2015-07-13 2018-05-11 帝人株式会社 信息处理装置、信息处理方法以及计算机程序
WO2020121115A1 (fr) * 2018-12-13 2020-06-18 株式会社半導体エネルギー研究所 Procédé de classification de contenu et procédé de génération de modèle de classification
CN110211592A (zh) * 2019-05-17 2019-09-06 北京华控创为南京信息技术有限公司 智能语音数据处理装置及方法
CN112069796A (zh) * 2020-09-03 2020-12-11 阳光保险集团股份有限公司 一种语音质检方法、装置,电子设备及存储介质
CN112069796B (zh) * 2020-09-03 2023-08-04 阳光保险集团股份有限公司 一种语音质检方法、装置,电子设备及存储介质
CN113204685A (zh) * 2021-04-25 2021-08-03 Oppo广东移动通信有限公司 资源信息获取方法及装置、可读存储介质、电子设备
CN115132198A (zh) * 2022-05-27 2022-09-30 腾讯科技(深圳)有限公司 数据处理方法、装置、电子设备、程序产品及介质
CN115132198B (zh) * 2022-05-27 2024-03-15 腾讯科技(深圳)有限公司 数据处理方法、装置、电子设备、程序产品及介质

Similar Documents

Publication Publication Date Title
US11367450B2 (en) System and method of diarization and labeling of audio data
US11580991B2 (en) Speaker based anaphora resolution
US10917758B1 (en) Voice-based messaging
US8831947B2 (en) Method and apparatus for large vocabulary continuous speech recognition using a hybrid phoneme-word lattice
US7788095B2 (en) Method and apparatus for fast search in call-center monitoring
US9245523B2 (en) Method and apparatus for expansion of search queries on large vocabulary continuous speech recognition transcripts
US8209171B2 (en) Methods and apparatus relating to searching of spoken audio data
US20110004473A1 (en) Apparatus and method for enhanced speech recognition
US20120271631A1 (en) Speech recognition using multiple language models
WO2014203328A1 (fr) Système de recherche de données vocales, procédé de recherche de données vocales et support d'informations lisible par ordinateur
US9311914B2 (en) Method and apparatus for enhanced phonetic indexing and search
US9495955B1 (en) Acoustic model training
JPH10507536A (ja) 言語認識
Hain et al. The 2005 AMI system for the transcription of speech in meetings
JP5326169B2 (ja) 音声データ検索システム及び音声データ検索方法
Yang et al. Open source magicdata-ramc: A rich annotated mandarin conversational (ramc) speech dataset
US10417345B1 (en) Providing customer service agents with customer-personalized result of spoken language intent
Hori et al. A statistical approach to automatic speech summarization
US8423354B2 (en) Speech recognition dictionary creating support device, computer readable medium storing processing program, and processing method
JP5897718B2 (ja) 音声検索装置、計算機読み取り可能な記憶媒体、及び音声検索方法
Biswas et al. Speech Recognition using Weighted Finite-State Transducers
Tarján et al. Improved recognition of Hungarian call center conversations
Sárosi et al. On modeling non-word events in large vocabulary continuous speech recognition
Hansen et al. Audio stream phrase recognition for a national gallery of the spoken word:" one small step".
WO2014155652A1 (fr) Système d'extraction de haut-parleur et programme

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13887474

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 13887474

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: JP