JP2003255981A

JP2003255981A - Method, device and program for providing summary information

Info

Publication number: JP2003255981A
Application number: JP2002058447A
Authority: JP
Inventors: Kota Hidaka; 浩太日▲高▼; Shinya Nakajima; 信弥中嶌; Osamu Mizuno; 理水野; Hidekatsu Kuwano; 秀豪桑野; Haruhiko Kojima; 治彦児島
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-03-05
Filing date: 2002-03-05
Publication date: 2003-09-10
Anticipated expiration: 2022-03-05
Also published as: JP3803301B2

Abstract

<P>PROBLEM TO BE SOLVED: To provide a summary distributing method which can be used by a job offerer to decide talent of many job seekers without taking a long time. <P>SOLUTION: Speech waveform information obtained from a speaker's speech, video information generated by photographing the speaking state of the speaker, and personal information that the speaker inputs are stored in a database and personal information stored in the database is retrieved according to retrieval conditions that a user requests to extract a speaker meeting the retrieval conditions; and a stressed speech section of the speaker is extracted as summary information from speech information of the speaker and video information corresponding to the summary speech information is extracted, so that the video information and summary speech information are distributed to the user. <P>COPYRIGHT: (C)2003,JPO

Description

Detailed Description of the Invention

【０００１】[0001]

【発明の属する技術分野】この発明は映像付音声の内容
の要部を決定する要約情報提供方法、要約情報提供装
置、要約情報提供プログラムに関し、例えば求職者の自
己ＰＲ映像を自動的に生成し、採用活動の負担を軽減す
る人材発掘システムに応用したものである。BACKGROUND OF THE INVENTION 1. Field of the Invention The present invention relates to a summary information providing method, a summary information providing apparatus, and a summary information providing program for determining a main part of the contents of video-added audio. For example, a job seeker's self-PR video is automatically generated. , It was applied to a human resource finding system that reduces the burden of recruiting activities.

【０００２】[0002]

【従来の技術】従来の技術では、公共職業安定所等の職
業斡旋所などに設置され、求人、求職者が独自に各種情
報の伝達、交換をおこない、面接予約、テレビ電話を利
用した簡易面接を援助する職業紹介システムがあった。
たとえば、日本国特開平１１−１４３９５７号公報など
に示されている。また、応募者と企業とをネットワーク
を介して接続し、応募者と企業に関する情報をインタラ
クティブに管理するシステムがあった。たとえば、日本
国特開２００１−２０２４０７公報などに示されてい
る。2. Description of the Related Art In the prior art, a job placement office, such as a public employment security office, has been established in the prior art. Job seekers and job seekers independently communicate and exchange various information, interview appointments, and simple interviews using videophones. There was a job placement system to help
For example, it is disclosed in Japanese Patent Laid-Open No. 11-143957. In addition, there was a system for connecting applicants and companies via a network and interactively managing information about applicants and companies. For example, it is shown in Japanese Patent Laid-Open No. 2001-202407.

【０００３】また、保有資格、実務経験年数、希望職
種、履歴情報などから派遣労働者の適正や性能を診断す
る派遣労働者の登録システムがあった。たとえば、日本
国特開２００１−２２９２７８公報などに示されてい
る。また、結婚紹介などで年齢、身長、体重、兄弟関
係、学歴、職業、収入、趣味、特技などの文字データ
と、必要に応じて、自己紹介音声、映像をもちいて仲介
者を必要としないコンピュータネットワークがあった。
たとえば、日本国特開平６−１９９２６号公報などに示
されている。There has also been a temporary worker registration system for diagnosing the suitability and performance of temporary workers from holding qualifications, years of work experience, desired job types, history information and the like. For example, it is shown in Japanese Patent Laid-Open No. 2001-229278. A computer that uses character data such as age, height, weight, sibling relationships, educational background, occupation, income, hobbies, and special skills for marriage introduction, etc., and, if necessary, self-introduction voices and images, and does not require an intermediary. There was a network.
For example, it is shown in Japanese Patent Laid-Open No. 6-19926.

【０００４】また、入力項目に応じて、スキル、担当業
務内容などを自動的に抽出し、また、自己ＰＲポイント
を自動的に抽出し、求職者が個人の職務経歴を入力する
と、企業側が求めているサマリーを作成するなどの求職
求人情報システムがあった。たとえば、日本国特開２０
０１−１４２９３９号公報などに示されている。また、
オーディションなどで自己の要旨、趣味範囲、思考、表
現力、歌唱力などの才能を含むアピール情報に関する応
募を簡易に行い、発掘側が検索するシステムがあった。
たとえば、日本国特開２０００−３０５９８０公報など
に示されている。In addition, according to the input items, skills, contents of work in charge, etc. are automatically extracted, and self PR points are also automatically extracted, and when the job seeker inputs the personal work history, the company side asks. There was a job hunting and job information system such as creating a summary. For example, Japanese Patent Laid-Open No. 20
No. 01-142939, etc. Also,
There was a system that allows the excavator side to easily apply for appeal information, including talents such as the subject's abstract, hobbies, thoughts, expressiveness, and singing ability, through auditions.
For example, it is shown in Japanese Patent Laid-Open No. 2000-305980.

【０００５】[0005]

【発明が解決しようとする課題】例えば採用者が採用活
動を行うには、採用者が求職者の書類審査を行い、求職
者を１回以上面接して採用可否を決定している。面接
後、保存されるのは書類だけで求職者の印象は採用者の
記憶にしか頼ることが出来ない。また、映像などに記録
されていた場合においても、採用者の映像をすべて見る
ことは時間を浪費するため、現実的ではない。書類によ
る第一次選考などの方法もあるが、時間的浪費を軽減す
るためのものであり、求職者本人に会うことなしに、あ
るいは求職者の映像を見ることなしに求職者が採用者の
希望に見合わないかは判断不可能である。For example, in order for an employer to carry out recruitment activities, the employer examines the documents of the job seeker and interviews the job seeker one or more times to decide whether or not to hire. After the interview, only the documents are saved, and the impression of the job seeker can only rely on the memory of the employer. Further, even if it is recorded in a video or the like, it is not realistic to watch all the videos of the employer because it wastes time. Although there is a method such as the first selection based on documents, it is for reducing time waste, and the job seeker does not have to meet the job seeker himself or see the video of the job seeker It is impossible to judge whether it does not meet your wishes.

【０００６】日本国特開２００１−１４２９３９公報で
は、入力項目に応じて、スキル、担当業務内容などを自
動抽出し、自己ＰＲポイントを自動抽出し、求職者が個
人の職務履歴を入力すると、企業側が求めているサマリ
ーを作成するなどの処理を施しているが、これらはテキ
スト情報から導かれたものであり、たとえば自己ＰＲを
テキスト情報だけで判断することは不可能である。ま
た、日本国特開２００１−２２９２７８公報では、派遣
労働者の適正や、能力を判断しているが、適正や能力だ
けで採用するわけではなく、これらのみで採用可能であ
れば面接などは必要がない。テキスト情報に依存した採
用システムは前記第一次選考の簡易化にすぎず、有用な
方法とは言えない。In Japanese Unexamined Patent Publication No. 2001-142939, skills, tasks in charge, etc. are automatically extracted according to input items, self-PR points are automatically extracted, and a job seeker inputs personal job history. Although a process such as creating a summary requested by the side is performed, these are derived from the text information, and it is impossible to judge the self-PR only by the text information, for example. Further, in Japanese Unexamined Patent Publication No. 2001-229278, the suitability and ability of the dispatched worker are judged, but it is not necessary to employ only the suitability and ability, and if it is possible to employ only these, an interview etc. is necessary. There is no. The adoption system that relies on text information is merely a simplification of the above-mentioned primary selection, and cannot be said to be a useful method.

【０００７】日本国特開平６−１９９２６号公報ではテ
キストベースの自己データに加えて、写真、映像などの
登録も行い、また、日本国特開２０００−３０５９８０
公報では、オーディションなどにたいして、自己の容姿
などを画像や映像なども用いておこなえるシステムを開
発しているが、これらは最終的には映録を再生する時間
が必要であり、要旨を理解するには早送りなどの機能を
用いても限界がある。特開２００１−２０２４０７公報
では、求職者と採用者を、ネットワークを介して接続し
情報をインタラクティブに管理しているが採用の負担が
軽減する構成にはなっていない。また、特開平１１−１
４３９５７号公報ではＴＶ電話などにより、簡易的な面
接も実現しているが、ネットワークを介しているだけで
あり、面接にかかる場所の移動以外に採用者の採用活動
負担の軽減にはなっていない。Japanese Unexamined Patent Publication (Kokai) No. Hei 6-19926 (Japanese Unexamined Patent Publication No. 6-199326) registers photographs and images in addition to text-based self-data.
In the official gazette, we are developing a system that allows us to perform our own appearance by using images and videos for auditions, etc., but these eventually require time to reproduce the movie, so to understand the gist There is a limit in using functions such as fast forward. In Japanese Patent Laid-Open No. 2001-202407, job seekers and employers are connected via a network and information is interactively managed, but the configuration does not reduce the burden of recruitment. In addition, JP-A-11-1
In Japanese Patent No. 43957, a simple interview is realized by a TV phone or the like, but it is only through a network, and the burden of recruiting activity on the employer is not reduced other than moving the place for the interview. .

【０００８】本発明は、前記のような従来の技術の有す
る欠点に鑑みてなされたもので、例えば求職者のように
個人情報を提供する情報提供者が提供した映像を効率的
に要約することで、多数の情報を短時間で閲覧し、多数
の情報の中から目的に合致した情報を検索する作業量の
激減を図ることができる要約情報提供方法、要約情報提
供装置、要約情報提供プログラムを提供しようとするも
のである。The present invention has been made in view of the above-mentioned drawbacks of the prior art, and efficiently summarizes the image provided by an information provider who provides personal information such as a job seeker. , A summary information providing method, a summary information providing apparatus, and a summary information providing program capable of drastically reducing the amount of work for browsing a large amount of information in a short time and searching for information that matches a purpose from a large number of information. It is the one we are trying to provide.

【０００９】[0009]

【課題を解決するための手段】前記問題点を解決するた
めに、情報提供者が提供する音声付映像を要約する要約
情報提供方法を提供することを本発明の最も主要な特徴
とするものである。この発明では項目別に映像信号と同
時に収録された音声信号と該音声信号の属性情報とを対
応付けて蓄積するデータ蓄積手段と、少なくとも基本周
波数又はピッチ周期、パワー、動的特徴量の時間変化特
性、又はこれらのフレーム間差分を含む特徴量と強調状
態での出現確率とを対応して格納した符号帳とを用い、
希望属性情報を入力し、前記希望属性情報で示される条
件を満足する属性情報と該属性情報に対応する項目別の
映像信号と音声信号を前記データ蓄積手段から読み出
し、前記音声信号をフレーム毎に分析した前記特徴量に
対応する強調状態での出現確率を求め、前記強調状態で
の出現確率に基づいて強調状態となる確率を算出し、前
記強調状態となる確率が所定の確率よりも大きい音声信
号区間を要約区間と判定し、前記要約区間の映像信号と
前記読み出された属性情報の少なくとも一部を出力する
要約情報提供方法を提案する。In order to solve the above-mentioned problems, the most important feature of the present invention is to provide a summary information providing method for summarizing a video with audio provided by an information provider. is there. According to the present invention, a data storage means for storing an audio signal recorded at the same time as a video signal for each item and attribute information of the audio signal in association with each other, and at least a fundamental frequency or pitch period, power, and a time change characteristic of a dynamic feature amount. , Or using a codebook in which feature quantities including these inter-frame differences and appearance probabilities in the emphasized state are stored in association with each other,
Desired attribute information is input, attribute information satisfying the conditions indicated by the desired attribute information, and video signals and audio signals for each item corresponding to the attribute information are read out from the data storage means, and the audio signal is read for each frame. Obtaining the appearance probability in the emphasized state corresponding to the analyzed characteristic amount, calculating the probability of becoming the emphasized state based on the appearance probability in the emphasized state, the probability of becoming the emphasized state is larger than a predetermined probability A method for providing summary information, which determines a signal section as a summary section and outputs at least a part of the video signal of the summary section and the read attribute information, is proposed.

【００１０】この発明では更に前記要約区間は、前記符
号帳が少なくとも基本周波数又はピッチ周期、パワー、
動的特徴量の時間変化特性、又はこれらのフレーム間差
分を含む特徴量と強調状態での出現確率に対応して平静
状態での出現確率が格納され、前記音声信号をフレーム
毎に分析した前記特徴量に対応する平静状態での出現確
率を求め、前記平静状態での出現確率に基づいて平静状
態となる確率を算出し、前記強調状態となる確率の前記
平静状態となる確率に対する確率比を音声信号区間ごと
に算出し、前記確率比の降順に対応する音声信号区間の
時間を累積して要約区間の時間の総和を算出し、前記要
約区間の時間の総和が所定の要約時間となる音声信号区
間を要約区間と決定する要約情報提供方法を提案する。According to the present invention, further, in the summary section, the codebook has at least a fundamental frequency or a pitch period, a power,
The time-varying characteristic of the dynamic feature amount, or the feature amount including these inter-frame differences and the appearance probability in the quiet state corresponding to the appearance probability in the emphasized state are stored, and the speech signal is analyzed for each frame. Obtaining the appearance probability in the calm state corresponding to the feature amount, calculating the probability of becoming a calm state based on the appearance probability in the calm state, the probability ratio of the probability of becoming the emphasized state to the probability of becoming the calm state A voice that is calculated for each voice signal section, accumulates the times of the voice signal sections corresponding to the descending order of the probability ratios, and calculates the sum of the times of the summary sections, and the sum of the times of the summary sections becomes a predetermined summarization time. We propose a method for providing summary information that determines a signal section as a summary section.

【００１１】この発明では更に前記音声信号をフレーム
ごとに無音区間か否か、有声区間か否か判定し、所定フ
レーム数以上の無音区間で囲まれ、有声区間を含む部分
を音声小段落と判定し、音声小段落に含まれる有声区間
の平均パワーが該音声小段落内の平均パワーの所定の定
数倍より小さい音声小段落を末尾とする音声小段落群を
音声段落と判定し、前記音声信号区間は音声段落ごとに
定められたものであり、前記要約時間を音声段落ごとに
累積して求め、前記強調状態の確率又は前記確率比の降
順に音声段落ごとに前記要約区間の映像信号と音声信号
を出力する要約情報提供方法を提案する。According to the present invention, it is further determined for each frame whether the voice signal is in a silent section or in a voiced section, and a portion surrounded by a voiceless section of a predetermined number of frames or more and a section including a voiced section is determined as a voice sub-paragraph. However, the average power of voiced sections included in the audio sub-paragraph is determined to be a voice sub-paragraph group ending with a voice sub-paragraph that is smaller than a predetermined constant multiple of the average power in the voice sub-paragraph, and the audio signal The section is defined for each audio paragraph, and the summary time is obtained by accumulating the summary time for each audio paragraph, and the video signal and the audio of the summary section are calculated for each audio paragraph in descending order of the probability of the emphasized state or the probability ratio. We propose a method for providing summary information that outputs a signal.

【００１２】この発明では更に項目別に映像信号と同時
に収録された音声信号と、該音声信号の属性情報とを対
応付けて蓄積するデータ蓄積手段と、少なくとも基本周
波数又はピッチ周期、パワー、動的特徴量の時間変化特
性、又はこれらのフレーム間差分を含む特徴量と強調状
態での出現確率とを対応して格納した符号帳とを用い、
希望属性情報を入力し、前記希望属性情報で示される条
件を満足する属性情報と該属性情報に対応する項目別の
映像信号と音声信号を前記データ蓄積手段から読み出
し、前記音声信号をフレーム毎に分析した前記特徴量に
対応する強調状態での出現確率を求め前記強調状態での
出現確率に基づいて強調状態となる確率を算出する強調
状態確率計算部と、前記強調状態となる確率が所定の確
率より大きい音声信号区間を要約区間と判定する要約区
間決定部と、前記要約区間の映像信号と前記読み出され
た属性情報の少なくとも一部を出力する出力部とを具備
している要約情報提供装置を提案する。According to the present invention, further, the data storage means for storing the audio signal recorded simultaneously with the video signal for each item and the attribute information of the audio signal in association with each other, at least the fundamental frequency or the pitch period, the power and the dynamic characteristics. Using a codebook in which the time change characteristics of the amount, or the feature amount including these inter-frame differences and the appearance probability in the emphasized state are stored in association with each other,
Desired attribute information is input, attribute information satisfying the conditions indicated by the desired attribute information, and video signals and audio signals for each item corresponding to the attribute information are read out from the data storage means, and the audio signal is read for each frame. An emphasis state probability calculation unit that calculates the appearance probability in the emphasized state corresponding to the analyzed feature amount and calculates the probability of becoming the emphasized state based on the appearance probability in the emphasized state, and the probability of becoming the emphasized state is predetermined. Providing summary information including a summary section determination unit that determines a voice signal section that is greater than the probability as a summary section, and an output unit that outputs at least a part of the video signal of the summary section and the read attribute information. Suggest a device.

【００１３】この発明では更にコンピュータが読取り可
能な符号によって記述され、前記の要約情報提供方法の
何れかをコンピュータ上で実行する要約情報提供プログ
ラムを提案する。［作用］この発明によれば音声要約手段は、情報提供者
が提供する映像の音声を分析し、音声の重要部分（強調
区間）を抽出している。そのため、音声の重要部分の映
像をつなげて再生すると、映像の要旨や情報提供者の印
象を強く伝えることが可能となり、本発明の目的であ
る、提供された情報の要約を行うことが出来ることにな
る。The present invention further proposes a summary information providing program which is described by a computer readable code and executes any of the above summary information providing methods on a computer. [Operation] According to the present invention, the voice summarizing means analyzes the voice of the video provided by the information provider and extracts the important part (emphasized section) of the voice. Therefore, by connecting and reproducing the video of the important part of the audio, it becomes possible to strongly convey the gist of the video and the impression of the information provider, and it is possible to perform the summary of the provided information, which is the object of the present invention. become.

【００１４】データセンタは、任意の時間やシーン数
で、情報提供者である話者の映像を要約する。そのた
め、情報利用者（この要約情報提供方法を利用して例え
ば求人活動等を行う利用者）の希望する時間やシーン数
で要約映像を視聴することが可能となり、本発明の目的
である、採用者の採用活動の稼動を軽減することが出来
ることになる。情報利用者は希望する情報提供者の映像
を原映像より短かい時間で要部に絞って視聴する。その
ため、情報提供者の映像を情報利用者に強く印象付ける
ことが可能となり、本発明の目的である検索活動の効率
化が出来ることになる。The data center summarizes the video of the speaker who is the information provider at an arbitrary time and the number of scenes. Therefore, it becomes possible to view the summary video at the time and the number of scenes desired by the information user (for example, a user who uses this summary information providing method), which is an object of the present invention. It will be possible to reduce the operation of hiring activities for employees. The information user watches the video of the desired information provider in a shorter time than the original video, focusing on the main part. Therefore, the image of the information provider can be strongly impressed on the information user, and the efficiency of the search activity, which is the object of the present invention, can be improved.

【００１５】情報提供者は自己ＰＲ映像を情報利用者に
視聴される。そのため、情報提供者がテキスト情報以外
に情報利用者に自己アピールを行うことが可能となり、
本発明の目的であるテキスト情報に依存しない検索活動
が出来ることになる。The information provider views the self-PR video by the information user. Therefore, it becomes possible for information providers to make self-appeal to information users in addition to text information.
The search activity independent of the text information, which is the object of the present invention, can be performed.

【００１６】[0016]

【発明の実施の形態】ここで、この発明で用いられる音
声小段落抽出方法、音声段落抽出方法、各音声小段落毎
に強調状態となる確率及び平静状態となる確率を求める
方法について、説明する。図１７に先に提案した音声要
約方法の実施形態の基本手順を示す。ステップＳ１で入
力音声信号を分析して音声特徴量を求める。ステップＳ
２で、入力音声信号の音声小段落と、複数の音声小段落
から構成される音声段落を抽出する。ステップＳ３で各
音声小段落を構成するフレームが平静状態か、強調状態
か発話状態を判定する。この判定に基づきステップＳ４
で要約音声を作成し、要約音声を得る。BEST MODE FOR CARRYING OUT THE INVENTION Here, a method for extracting a voice sub-paragraph, a method for extracting a voice paragraph, and a method for obtaining a probability of being in an emphasized state and a probability of being in a quiet state for each voice sub-paragraph used in the present invention will be described. . FIG. 17 shows a basic procedure of an embodiment of the previously proposed voice summarization method. In step S1, the input voice signal is analyzed to obtain a voice feature amount. Step S
In step 2, a voice sub-paragraph of the input voice signal and a voice paragraph composed of a plurality of voice sub-paragraphs are extracted. In step S3, it is determined whether the frame forming each audio sub-paragraph is in a calm state, emphasized state, or uttered state. Based on this determination, step S4
Create summary voice with and get summary voice.

【００１７】以下に、自然な話し言葉や会話音声を、要
約に適用する場合の実施例を述べる。音声特徴量は、ス
ペクトル情報等に比べて、雑音環境下でも安定して得ら
れ、かつ話者に依存し難いものを用いる。入力音声信号
から音声特徴量として基本周波数（ｆ０）、パワー
（ｐ）、音声の動的特徴量の時間変化特性（ｄ）、ポー
ズ時間長（無音区間）（ｐｓ）を抽出する。これらの音
声特徴量の抽出法は、例えば、「音響・音響工学」（古
井貞煕、近代科学社、１９９８）、「音声符号化」（守
谷健弘、電子情報通信学会、１９９８）、「ディジタル
音声処理」（古井貞煕、東海大学出版会、１９８５）、
「複合正弦波モデルに基づく音声分析アルゴリズムに関
する研究」（嵯峨山茂樹、博士論文、１９９８）などに
述べられている。音声の動的特徴量の時間変化は発話速
度の尺度となるパラメータであり特許第２９７６９９８
号に記載のものを用いてもよい。即ち、動的変化量とし
てスペクトル包絡を反映するＬＰＣスペクトラム係数の
時間変化特性を求め、その時間変化をもとに発話速度係
数が求められるものである。より具体的にはフレーム毎
にＬＰＣスペクトラム係数Ｃ１（ｔ）、…Ｃｋ（ｔ）を
抽出して次式のような動的特徴量ｄ（ダイナミックメジ
ャー）を求める。ｄ（ｔ）＝Σi=1k［Σf=t-f0t+f0［ｆ
×Ｃi（ｔ）］／（Σf=t-f0t+f0ｆ2）2ここで、ｆ０は
前後の音声区間フレーム数（必ずしも整数個のフレーム
でなくとも一定の時間区間でもよい）、ｋはＬＰＣスペ
クトラムの次数、ｉ＝１、２、…ｋである。発話速度の
係数として動的特徴量の変化の極大点の単位時間当たり
の個数、もしくは単位時間当たりの変化率が用いられ
る。An embodiment in which natural spoken language or conversational voice is applied to the summary will be described below. As the voice feature amount, one that is more stable than the spectral information even in a noisy environment and is less likely to depend on the speaker is used. The fundamental frequency (f0), the power (p), the time variation characteristic (d) of the dynamic feature amount of the voice, and the pause time length (silent section) (ps) are extracted from the input voice signal as the voice feature amount. The method of extracting these speech feature amounts is, for example, “acoustic / acoustic engineering” (Sadahiro Furui, Modern Science Co., 1998), “speech coding” (Takehiro Moriya, Institute of Electronics, Information and Communication Engineers, 1998), “Digital”. Speech processing "(Sadahiro Furui, Tokai University Press, 1985),
"Sound analysis algorithm based on complex sine wave model" (Shigeki Sagayama, Ph.D. thesis, 1998). The change over time in the dynamic feature amount of voice is a parameter that is a measure of the speech rate, and is disclosed in Japanese Patent No. 2976998.
You may use the thing of the No. That is, the time variation characteristic of the LPC spectrum coefficient that reflects the spectrum envelope as the dynamic variation is obtained, and the speech rate coefficient is obtained based on the time variation. More specifically, the LPC spectrum coefficient C1 (t), ... d (t) = Σi = 1k [Σf = t-f0t + f0 [f
× Ci (t)] / (Σf = t-f0t + f0f2) 2 where f0 is the number of preceding and following speech section frames (not necessarily an integral number of frames but may be a fixed time section), and k is the LPC spectrum. The order is i = 1, 2, ... K. As the coefficient of the speech rate, the number of maximum points of the change in the dynamic feature amount per unit time or the rate of change per unit time is used.

【００１８】実施例では例えば１００ｍｓを１フレーム
とし、シフトを５０ｍｓとする。１フレーム毎の平均の
基本周波数を求める（ｆ０´）。パワーについても同様
に１フレーム毎の平均パワー（ｐ´）を求める。更に現
フレームのｆ０´と±ｉフレーム前後のｆ０´との差分
をとり、±Δｆ０´ｉ（Δ成分）とする。パワーについ
ても同様に現フレームのｐ´と±ｉフレーム前後のｐ´
との差分±Δｐ´ｉ（Δ成分）を求める。ｆ０´、±Δ
ｆ０´ｉ、ｐ´、±Δｐ´ｉを規格化する。この規格は
例えばｆ０´、±Δｆ０´ｉをそれぞれ、音声波形全体
の平均基本周波数で割り規格化する。これら規格化され
た値をｆ０″、±ｆ０″ｉと表す。ｐ´、±Δｐ´ｉに
ついても同様に、発話状態判定の対象とする音声波形全
体の平均パワーで割り、規格化する。規格化するにあた
り、後述する音声小段落、音声段落ごとの平均パワーで
割ってもよい。これら規格化された値をｐ″、±Δｐ″
ｉと表す。ｉの値は例えばｉ＝４とする。現フレームの
前後±Ｔ１ｍｓの、ダイナミックメジャーのピーク本
数、即ち動的特徴量の変化の極大点の個数ｄｐを算出す
る。これと、現フレームの開始時刻の、Ｔ２ｍｓ前の時
刻を区間に含むフレームのｄｐとのΔ成分（−Δｄｐ）
を求める。前記±Ｔ１ｍｓのｄｐと、現フレームの終了
時刻の、Ｔ３ｍｓ後の時刻を区間に含むフレームのｄｐ
とのΔ成分（＋Δｄｐ）を求める。これら、Ｔ１、Ｔ
２、Ｔ３の値は例えばＴ１＝Ｔ２＝Ｔ３＝４５０ｍｓと
する。フレームの前後の無音区間の時間長を±ｐｓとす
る。ステップＳ１ではこれら音声特徴パラメータの各値
をフレーム毎に抽出する。In the embodiment, for example, 100 ms is set as one frame and the shift is set as 50 ms. An average fundamental frequency is calculated for each frame (f0 '). Regarding the power, similarly, the average power (p ') for each frame is obtained. Further, the difference between f0 ′ of the current frame and f0 ′ before and after ± i frames is taken as ± Δf0′i (Δ component). Similarly for power, p ′ of the current frame and p ′ before and after ± i frames
And the difference ± Δp′i (Δ component) is calculated. f0 ', ± Δ
Normalize f0'i, p ', ± Δp'i. In this standard, for example, f0 ′ and ± Δf0′i are divided by the average fundamental frequency of the entire voice waveform and standardized. These standardized values are represented as f0 ″ and ± f0 ″ i. Similarly, p ′ and ± Δp′i are also normalized by dividing by the average power of the entire speech waveform that is the target of speech state determination. In normalizing, it may be divided by an average power for each audio sub-paragraph and audio paragraph described later. These normalized values are p ″, ± Δp ″
Denote by i. The value of i is, for example, i = 4. The number of peaks of the dynamic measure, that is, the number dp of the maximum points of the change of the dynamic feature amount, within ± T1 ms before and after the current frame, is calculated. Δ component (−Δdp) between this and the dp of the frame including the time T2ms before the start time of the current frame in the section
Ask for. Dp of the above-mentioned ± T1ms and the dp of the frame including the time T3ms after the end time of the current frame
And the Δ component (+ Δdp) of These T1, T
The values of 2 and T3 are, for example, T1 = T2 = T3 = 450 ms. The time length of the silent section before and after the frame is ± ps. In step S1, each value of these audio characteristic parameters is extracted for each frame.

【００１９】ステップＳ２における入力音声の音声小段
落と、音声段落を抽出する方法の例を図１８に示す。こ
こで音声小段落を発話状態判定を行う単位とする。ステ
ップＳ２０１で、入力音声信号の無音区間と有声区間を
抽出する。無音区間は例えばフレーム毎のパワーが所定
のパワー値以下であれば無音区間と判定し、有声区間
は、例えばフレーム毎の相関関数が所定の相関関数値以
上であれば有声区間と判定する。有声／無声の決定は、
周期性／非周期性の特徴と同一視することにより、自己
相関関数や変形相関関数のピーク値で行うことが多い。
入力信号の短時間スペクトルからスペクトル包絡を除去
した予測残差の自己相関関数が変形相関関数であり、変
形相関関数のピークが所定の閾値より大きいか否かによ
って有声／無声の判定を行い、又そのピークを与える遅
延時間によってピッチ周期１／ｆ０（基本周波数ｆ０）
の抽出を行う。これらの区間の抽出法の詳細は、例え
ば、「ディジタル音声処理」（古井貞煕、東海大学出版
会、１９８５）などに述べられている。ここでは音声信
号から各音声特徴量をフレーム毎に分析することについ
て述べたが、既に符号化等により分析された係数もしく
は符号に対応する特徴量を符号化に用いる符号帳から読
み出して用いてもよい。FIG. 18 shows an example of the voice sub-paragraph of the input voice and the method of extracting the voice paragraph in step S2. Here, the audio sub-paragraph is used as a unit for determining the utterance state. In step S201, a silent section and a voiced section of the input voice signal are extracted. For example, the silent section is determined to be a silent section when the power of each frame is equal to or lower than a predetermined power value, and the voiced section is determined to be a voiced section when the correlation function of each frame is equal to or higher than a predetermined correlation function value. Voiced / unvoiced decision is
In many cases, the peak value of the autocorrelation function or the modified correlation function is used by equating it with the characteristic of periodicity / aperiodicity.
The autocorrelation function of the prediction residual obtained by removing the spectrum envelope from the short-time spectrum of the input signal is the modified correlation function, and the voiced / unvoiced determination is made depending on whether the peak of the modified correlation function is larger than a predetermined threshold value. Pitch cycle 1 / f0 (fundamental frequency f0) depending on the delay time that gives the peak
Is extracted. Details of the method for extracting these sections are described in, for example, “Digital Speech Processing” (Sadahiro Furui, Tokai University Press, 1985). Here, it has been described that each voice feature amount is analyzed for each frame from the voice signal. Good.

【００２０】ステップＳ２０２で、有声区間を囲む無音
区間の時間がそれぞれｔ秒以上になるとき、その無音区
間で囲まれた有声区間を含む部分を音声小段落とする。
このｔは例えばｔ＝４００ｍｓとする。ステップＳ２０
３で、この音声小段落内の好ましくは後半部の、有声区
間の平均パワーと、その音声小段落の平均のパワーの値
ＢAの定数β倍とを比較し、前者の方が小さい場合はそ
の音声小段落を末尾音声小段落とし、直前の末尾音声小
段落後の音声小段落から現に検出した末尾音声小段落ま
でを音声段落として決定する。In step S202, when the time of each silent section surrounding the voiced section is t seconds or more, the portion including the voiced section surrounded by the silent section is set as a speech sub-paragraph.
This t is, for example, t = 400 ms. Step S20
In 3, the average power of the voiced section, preferably in the latter half of this audio sub-paragraph, is compared with the constant β times the average power value BA of the audio sub-paragraph. If the former is smaller, then The audio sub-paragraph is determined as the final audio sub-paragraph, and the audio sub-paragraph after the immediately preceding final audio sub-paragraph to the currently detected final audio sub-paragraph is determined as the audio paragraph.

【００２１】図１９に、有声区間、音声小段落、音声段
落を模式的に示す。音声小段落を前記の、有声区間を囲
む無音区間の時間がｔ秒の条件で、抽出する。図１９で
は、音声小段落ｊ−１、ｊ、ｊ＋１について示してい
る。ここで音声小段落ｊは、ｎ個の有声区間から構成さ
れ、平均パワーをＰｊとする。有声区間の典型的な例と
して、音声小段落ｊに含まれる、有声区間ｖの平均パワ
ーはｐｖである。音声段落ｋは、音声小段落ｊと音声小
段落を構成する後半部分の有声区間のパワーから抽出す
る。ｉ＝ｎ−αからｎまでの有声区間の平均パワーｐｉ
の平均が音声小段落ｊの平均パワーＰｊより小さいと
き、即ち、 Σｐｉ／（α＋１）＜βＰｊ式（１）を満たす時、音声小段落ｊが音声段落ｋの末尾音声小段
落であるとする。ただし、Σはｉ＝ｎ−αからｎまでで
ある。式（１）のα、βは定数であり、これらを操作し
て、音声段落を抽出する。実施例では、αは３、βは
０．８とした。このようにして末尾音声小段落を区切り
として隣接する末尾音声小段落間の音声小段落群を音声
段落と判定できる。FIG. 19 schematically shows a voiced section, a voice sub-paragraph, and a voice paragraph. The voice sub-paragraph is extracted under the condition that the time of the silent section surrounding the voiced section is t seconds. FIG. 19 shows audio subparagraphs j-1, j, and j + 1. Here, the speech subsection j is composed of n voiced sections, and the average power is Pj. As a typical example of the voiced section, the average power of the voiced section v included in the speech subsection j is pv. The voice paragraph k is extracted from the power of the voiced section in the latter half of the voice sub-paragraph j and the voice sub-paragraph. The average power pi of the voiced section from i = n-α to n
Is smaller than the average power Pj of the audio sub-paragraph j, that is, when Σpi / (α + 1) <βPj Expression (1) is satisfied, the audio sub-paragraph j is assumed to be the last audio sub-paragraph of the audio paragraph k. However, Σ is from i = n−α to n. Α and β in the equation (1) are constants, and these are manipulated to extract a voice paragraph. In the example, α was 3 and β was 0.8. In this way, a group of audio sub-paragraphs between the adjacent final audio sub-paragraphs can be determined as an audio paragraph with the final audio sub-paragraph as a delimiter.

【００２２】図１７中のステップＳ３における音声小段
落発話状態判定方法の例を図２０に示す。ステップＳ３
０１で、入力音声小段落の音声特徴量をベクトル量子化
する。このために、あらかじめ少なくとも２つの量子化
音声特徴量（コード）が格納された符号帳（コードブッ
ク）を作成しておく。ここでコードブックに蓄えられた
音声特徴量と入力音声もしくは既に分析して得られた音
声の音声特徴量との照合をとり、コードブックの中から
音声特徴量間の歪（距離）を最小にする量子化音声特徴
量を特定することが常套である。FIG. 20 shows an example of the speech subparagraph utterance state determination method in step S3 in FIG. Step S3
At 01, the voice feature amount of the input voice sub-paragraph is vector-quantized. For this purpose, a codebook in which at least two quantized speech feature quantities (codes) are stored is created in advance. Here, the voice feature stored in the codebook is compared with the voice feature of the input voice or the voice already obtained by analysis, and the distortion (distance) between the voice features in the codebook is minimized. It is conventional to specify the quantized speech feature amount to be used.

【００２３】図２１に、このコードブックの作成法の例
を示す。多数の学習用音声を被験者が聴取し、発話状態
が平静状態であるものと、強調状態であるものをラベリ
ングする（Ｓ５０１）。例えば、被験者が発話の中で強
調状態とする理由として、（ａ）声が大きく、名詞や接続詞を伸ばすように発話す
る（ｂ）話し始めを伸ばして話題変更を主張、意見を集約
するように声を大きくする（ｃ）声を大きく高くして重要な名詞等を強調する時（ｄ）高音であるが声はそれほど大きくない（ｅ）苦笑いしながら、焦りから本音をごまかすような
時（ｆ）周囲に同意を求める、あるいは問いかけるよう
に、語尾が高音になるとき（ｇ）ゆっくりと力強く、念を押すように、語尾の声が
大きくなる時（ｈ）声が大きく高く、割り込んで発話するという主
張、相手より大きな声で（ｉ）大きな声では憚られるような本音や秘密を発言す
る場合や、普段、声の大きい人にとっての重要なことを
発話するような時（例えば声が小さくボソボソ、ヒソヒ
ソという口調）を挙げた。この例では、平静状態とは、
前記の（ａ）〜（ｉ）のいずれでもなく、発話が平静で
あると被験者が感じたものとした。FIG. 21 shows an example of the method of creating this codebook. The test subject listens to a large number of learning voices and labels the one in the quiescent state and the one in the emphasized state (S501). For example, as the reason why the test subject puts emphasis in the utterance, (a) utter a loud voice and utter to extend nouns and conjunctions (b) extend the beginning of the utterance to insist on topic change and collect opinions Make the voice louder (c) Make the voice louder to emphasize important nouns, etc. (d) High tones but not too loud (e) When laughing and cheating the real intention (f) ) Speaking or asking for consent from others, when the ending is high-pitched (g) Slowly and powerfully, when the ending is loud, when the ending is loud (h) Voice is loud and high, and speaks , I.e., louder than the other party (i) when making a real or secret statement that can be overwhelmed by a loud voice, or when uttering something important to a loud speaker (for example, a soft voice , He said that he was liable. In this example, the calm state is
It was assumed that the subject felt that the utterance was calm, not any of the above (a) to (i).

【００２４】尚、上述では強調状態と判定する対象を発
話であるものとして説明したが、音楽でも強調状態を特
定することができる。ここでは音声付の楽曲において、
音声から強調状態を特定しようとした場合に、強調と感
じる理由として、（ａ）声が大きく、かつ声が高い（ｂ）声が力強い（ｃ）声が高く、かつアクセントが強い（ｄ）声が高く、声質が変化する（ｅ）声を伸長させ、かつ声が大きい（ｆ）声が大きく、かつ、声が高く、アクセントが強い（ｇ）声が大きく、かつ、声が高く、叫んでいる（ｈ）声が高く、アクセントが変化する（ｉ）声を伸長させ、かつ、声が大きく、語尾が高い（ｊ）声が高く、かつ、声を伸長させる（ｋ）声を伸長させ、かつ、叫び、声が高い（ｌ）語尾上がり力強い（ｍ）ゆっくり強め（ｎ）曲調が不規則（ｏ）曲調が不規則、かつ、声が高いまた、音声を含まない楽器演奏のみの楽曲でも強調状態
を特定することができる。その強調と感じる理由とし
て、（ａ）強調部分全体のパワー増大（ｂ）音の高低差が大きい（ｃ）パワーが増大する（ｄ）楽器の数が変化する（ｅ）曲調、テンポが変化する等である。In the above description, the object to be determined as the emphasized state is the utterance, but the emphasized state can be specified by music. Here, in music with audio,
When trying to specify the emphasis state from the voice, the reasons for feeling emphasis are as follows: (a) loud voice and high voice (b) strong voice (c) high voice and strong accent (d) voice High voice, voice quality changes (e) voice is extended, and voice is loud (f) voice is high, voice is high and accent is strong (g) voice is voice, voice is high, and yell A high (h) voice with a high accent and a changing voice (i) a long voice, and a large voice with a high ending (j) a high voice and a long voice (k) a long voice, Also, screaming and high voice (l) Word rising, powerful (m) Slowly strengthening (n) Irregular tone (o) Irregular tone and high voice The emphasis state can be specified. The reasons for feeling the emphasis are as follows: (a) increase in power of the entire emphasized part (b) large difference in pitch between sounds (c) increase in power (d) change in number of musical instruments (e) change in tone and tempo Etc.

【００２５】これらを基にコードブックを作成しておく
ことにより、発話に限らず音楽の要約も行うことができ
ることになる。平静状態と強調状態の各ラベル区間につ
いて、図１７中のステップＳ１と同様に、音声特徴量を
抽出し（Ｓ５０２）、パラメータを選択する（Ｓ５０
３）。平静状態と強調状態のラベル区間の、前記パラメ
ータを用いて、ＬＢＧアルゴリズムでコードブックを作
成する（Ｓ５０４）。ＬＢＧアルゴリズムについては、
例えば、（Ｙ．Ｌｉｎｄｅ，Ａ．Ｂｕｚｏａｎｄ
Ｒ．Ｍ．Ｇｒａｙ，“Ａｎａｌｇｏｒｉｔｈｍｆｏｒ
ｖｅｃｔｏｒｑｕａｎｔｉｚｅｒｄｅｓｉｇ
ｎ，”ＩＥＥＥＴｒａｎｓ．Ｃｏｍｍｕｎ．，ｖｏ
ｌ．Ｃｏｍ−２８，ｐｐ．８４−９５，１９８０）があ
る。コードブックサイズは２のｎ乗個に可変である。こ
のコードブック作成は音声小段落で又はこれより長い適
当な区間毎あるいは学習音声全体の音声特徴量で規格化
した音声特徴量を用いることが好ましい。By creating a codebook based on these, not only utterance but also music can be summarized. For each label section in the quiet state and the emphasized state, the voice feature amount is extracted (S502) and parameters are selected (S50), as in step S1 in FIG.
3). A codebook is created by the LBG algorithm using the parameters in the label sections in the quiet state and the emphasized state (S504). For the LBG algorithm,
For example, (Y. Linde, A. Buzo and
R. M. Gray, "Analgorithm for
vector quantizer design
n, "IEEE Trans. Commun., vo
l. Com-28, pp. 84-95, 1980). The codebook size can be changed to 2 to the n-th power. For this codebook creation, it is preferable to use a voice feature amount standardized in voice sub-paragraphs or for each appropriate section longer than this, or the voice feature amount of the entire learning voice.

【００２６】図２０中のステップＳ３０１で、このコー
ドブックを用いて、入力音声小段落の音声特徴量を、各
音声特徴量について規格化し、その規格化された音声特
徴量をフレーム毎に照合もしくはベクトル量子化し、フ
レーム毎にコード（量子化された音声特徴量）を得る。
この際の入力音声信号より抽出する音声特徴量は前記の
コードブック作成に用いたパラメータと同じである。強
調状態が含まれる音声小段落を特定するために、音声小
段落でのコードを用いて、発話状態の尤度（らしさ）
を、平静状態と強調状態について求める。このために、
あらかじめ、任意のコード（量子化音声特徴量）の出現
確率を、平静状態の場合と、強調状態の場合について求
めておき、この出現確率とそのコードとを組としてコー
ドブックに格納しておく、以下にこの出現確率の求め方
の例を述べる。前記のコードブック作成に用いた学習音
声中のラベルが与えられた１つの区間（ラベル区間）の
音声特徴量のコード（フレーム毎に得られる）が、時系
列でＣｉ、Ｃｊ、Ｃｋ、…Ｃｎであるとき、ラベル区間
αが強調状態となる確率をＰα（ｅ）、平静状態となる
確率をＰα（ｎ）とし、Ｐα（ｅ）＝Ｐｅｍｐ（Ｃｉ）Ｐｅｍｐ（Ｃｊ｜Ｃｉ）
…Ｐｅｍｐ（Ｃｎ｜Ｃｉ…Ｃｎ−１）＝Ｐｅｍｐ（Ｃ
ｉ）ΠＰｅｍｐ（Ｃｘ｜Ｃｉ…Ｃｘ−１）Ｐα（ｎ）＝Ｐｎｒｍ（Ｃｉ）Ｐｎｒｍ（Ｃｊ｜Ｃｉ）
…Ｐｎｒｍ（Ｃｎ｜Ｃｉ…Ｃｎ−１）＝Ｐｅｍｐ（Ｃ
ｉ）ΠＰｎｒｍ（Ｃｘ｜Ｃｉ…Ｃｘ−１）となる。ただし、Ｐｅｍｐ（Ｃｘ｜Ｃｉ…Ｃｘ−１）は
コード列Ｃｉ…Ｃｘ−１の次にＣｘが強調状態となる条
件付確率、Ｐｎｒｍ（Ｃｘ｜Ｃｉ…Ｃｘ−１）は同様に
Ｃｉ…Ｃｘ−１に対しＣｘが平静状態となる確率であ
る。ただし、Πはｘ＝ｉ＋１からｎまでの積である。ま
たＰｅｍｐ（Ｃｉ）は学習音声についてフレームで量子
化し、これらコード中のＣｉが強調状態とラベリングさ
れた部分に存在した個数を計数し、その計数値を全学習
音声の全コード数（フレーム数）で割り算した値であ
り、Ｐｎｒｍ（Ｃｉ）はＣｉが平静状態とラベリングさ
れた部分に存在した個数を全コード数で割り算した値で
ある。In step S301 in FIG. 20, the code feature is used to standardize the voice feature amount of the input voice sub-paragraph for each voice feature amount, and the standardized voice feature amount is collated for each frame or Vector quantization is performed to obtain a code (quantized speech feature amount) for each frame.
The voice feature quantity extracted from the input voice signal at this time is the same as the parameter used for the codebook creation. The likelihood of the utterance state is determined by using the code in the voice sub-paragraph to identify the voice sub-paragraph containing the emphasis state.
For the calm and stressed states. For this,
In advance, the appearance probability of an arbitrary code (quantized speech feature amount) is calculated for the case of the quiet state and the case of the emphasized state, and the appearance probability and the code are stored as a set in a codebook. An example of how to obtain this appearance probability will be described below. Codes (obtained for each frame) of the speech feature amount of one section (label section) given a label in the learning speech used for creating the codebook described above are Ci, Cj, Ck, ... Cn in time series. , The probability that the label section α is in the emphasized state is Pα (e), and the probability that the label section is in the stationary state is Pα (n). Pα (e) = Pemp (Ci) Pemp (Cj | Ci)
... Pemp (Cn | Ci ... Cn-1) = Pemp (C
i) ΠPemp (Cx | Ci ... Cx-1) Pα (n) = Pnrm (Ci) Pnrm (Cj | Ci)
... Pnrm (Cn | Ci ... Cn-1) = Pemp (C
i) ΠPnrm (Cx | Ci ... Cx-1). However, Pemp (Cx | Ci ... Cx-1) is a conditional probability that Cx is in an emphasized state next to the code sequence Ci ... Cx-1, and Pnrm (Cx | Ci ... Cx-1) is similarly Ci ... Cx-. It is the probability that Cx will be in a calm state with respect to 1. However, Π is a product of x = i + 1 to n. In addition, Pemp (Ci) quantizes the learning speech in frames, counts the number of Ci in these codes existing in the portion labeled as the emphasized state, and counts the count value for the total number of codes (the number of frames) of all the learning speeches. Pnrm (Ci) is a value obtained by dividing the number of Cis present in the portion labeled as a quiescent state by the total number of codes.

【００２７】このラベル区間αの各状態確率を簡単にす
るために、この例ではＮ−ｇｒａｍモデル（Ｎ＜ｎ）を
用いて、Ｐα（ｅ）＝Ｐｅｍｐ（Ｃｎ｜Ｃｎ−Ｎ＋１…Ｃｎ−
１）Ｐα（ｎ）＝Ｐｎｒｍ（Ｃｎ｜Ｃｎ−Ｎ＋１…Ｃｎ−
１）とする。つまりＣｎよりＮ−１個の過去のコード列Ｃｎ
−Ｎ＋１…Ｃｎ−１の次にＣｎが強調状態として得られ
る確率をＰα（ｅ）とし、同様にＮ−ｇｒａｍの確率値
をより低次のＭ−ｇｒａｍ（Ｎ≧Ｍ）の確率値と線形に
補間する線形補間法を適応することが好ましい。例えば
ＣｎよりＮ−１個の過去のコード列Ｃｎ−Ｎ＋１…Ｃｎ
−１の次にＣｎが平静状態として得られる確率をＰα
（ｎ）とする。このようなＰα（ｅ）、Ｐα（ｎ）の条
件付確率をラベリングされた学習音声の量子化コード列
から全てを求めるが、入力音声信号の音声特徴量の量子
化したコード列と対応するものが学習音声から得られて
いない場合もある。そのため、高次（即ちコード列の長
い）の条件付確率を単独出現確率とより低次の条件付出
現確率とを補間して求める。例えばＮ＝３のｔｒｉｇｒ
ａｍ、Ｎ＝２のｂｉｇｒａｍ、Ｎ＝１のｕｎｉｇｒａｍ
を用いて線形補間法を施す。Ｎ−ｇｒａｍ、線形補間
法、ｔｒｉｇｒａｍについては、例えば、「音声言語処
理」（北研二、中村哲、永田昌明、森北出版、１９
９６、２９頁）などに述べられている。即ち、Ｎ＝３（ｔｒｉｇｒａｍ）：Ｐｅｍｐ（Ｃｎ｜Ｃｎ−２
Ｃｎ−１）、Ｐｎｒｍ（Ｃｎ｜Ｃｎ−２Ｃｎ−１）Ｎ＝２（ｂｉｇｒａｍ）：Ｐｅｍｐ（Ｃｎ｜Ｃｎ−
１）、Ｐｎｒｍ（Ｃｎ｜Ｃｎ−１）Ｎ＝１（ｕｎｉｇｒａｍ）：Ｐｅｍｐ（Ｃｎ）、Ｐｎｒ
ｍ（Ｃｎ）であり、これら３つの強調状態でのＣｎの出現確率、ま
た３つの平静状態でのＣｎの出現確率をそれぞれ用いて
次式により、Ｐｅｍｐ（Ｃｎ|Ｃｎ−２Ｃｎ−１）、Ｐ
ｎｒｍ（Ｃｎ|Ｃｎ−２Ｃｎ−１）を計算することにす
る。Ｐｅｍｐ（Ｃｎ|Ｃｎ−２Ｃｎ−１）＝λｅｍｐ１Ｐｅｍｐ（Ｃｎ|Ｃｎ−２Ｃｎ −１）＋λｅｍｐ２Ｐｅｍｐ（Ｃｎ|Ｃｎ−１）＋λｅｍｐ３Ｐｅｍｐ（Ｃｎ）式（２）Ｐｎｒｍ（Ｃｎ|Ｃｎ−２Ｃｎ−１）＝λｎｒｍｌＰｎｒｍ（Ｃｎ|Ｃｎ−２Ｃｎ −１）＋λｎｒｍ２Ｐｎｒｍ（Ｃｎ|Ｃｎ−１）＋λｎｒｍ３Ｐｎｒｍ（Ｃｎ）式（３）Ｔｒｉｇｒａｍの学習データをＮとしたとき、すなわ
ち、コードが時系列でＣ１、Ｃ２、．．．ＣＮが得られ
たとき、λｅｍｐ１、λｅｍｐ２、λｅｍｐ３の再推定
式は前出の参考文献「音声言語処理」より次のようにな
る。 λｅｍｐ１＝１／ＮΣ（λｅｍｐ１Ｐｅｍｐ（Ｃｎ｜Ｃ
ｎ−２Ｃ−１）／（λｅｍｐ１Ｐｅｍｐ（Ｃｎ｜Ｃｎ−
２Ｃ−１）＋λｅｍｐ２Ｐｅｍｐ（Ｃｎ｜Ｃ−１）＋λ
ｅｍｐ３Ｐｅｍｐ（Ｃｎ））） λｅｍｐ２＝１／ＮΣ（λｅｍｐ２Ｐｅｍｐ（Ｃｎ｜Ｃ
−１）／（λｅｍｐ１Ｐｅｍｐ（Ｃｎ｜Ｃｎ−２Ｃ−
１）＋λｅｍｐ２Ｐｅｍｐ（Ｃｎ｜Ｃ−１）＋λｅｍｐ
３Ｐｅｍｐ（Ｃｎ））） λｅｍｐ３＝１／ＮΣ（λｅｍｐ３Ｐｅｍｐ（Ｃｎ）／
（λｅｍｐ１Ｐｅｍｐ（Ｃｎ｜Ｃｎ−２Ｃ−１）＋λｅ
ｍｐ２Ｐｅｍｐ（Ｃｎ｜Ｃ−１）＋λｅｍｐ３Ｐｅｍｐ
（Ｃｎ）））ただし、Σはｎ＝１からＮまでの和である。以下同様に
してλｎｒｍ１、λｎｒｍ２、λｎｒｍ３も求められ
る。In order to simplify each state probability of this label section α, in this example, using the N-gram model (N <n), Pα (e) = Pemp (Cn | Cn-N + 1 ...
1) Pα (n) = Pnrm (Cn | Cn−N + 1 ... Cn−
1) That is, N-1 past code strings Cn from Cn
The probability that Cn will be obtained as an emphasized state next to -N + 1 ... Cn-1 is Pα (e), and similarly, the probability value of N-gram is linear with the probability value of lower-order M-gram (N ≧ M). It is preferable to apply a linear interpolation method that interpolates to For example, N-1 past code strings Cn-N + 1 ... Cn from Cn
−1, the probability that Cn is obtained in a calm state is Pα
(N). All of the conditional probabilities of Pα (e) and Pα (n) are obtained from the quantized code strings of the labeled learning speech, which correspond to the quantized code strings of the speech feature amount of the input speech signal. May not be obtained from the learning voice. Therefore, a high-order (that is, a long code string) conditional probability is obtained by interpolating a single occurrence probability and a lower-order conditional occurrence probability. For example, N = 3 trigr
am, bigram of N = 2, unigram of N = 1
Is used to perform the linear interpolation method. Regarding N-gram, linear interpolation method, and trigram, for example, “Spoken language processing” (Kenji Kita, Satoshi Nakamura, Masaaki Nagata, Morikita Publishing, 19
96, 29). That is, N = 3 (trigram): Pemp (Cn | Cn-2
Cn-1), Pnrm (Cn | Cn-2Cn-1) N = 2 (bigram): Pemp (Cn | Cn-
1), Pnrm (Cn | Cn-1) N = 1 (unigram): Pemp (Cn), Pnr
m (Cn), the probability of occurrence of Cn in these three emphasized states, and the probability of occurrence of Cn in three calm states are respectively calculated by the following equations, Pemp (Cn | Cn-2Cn-1), Pm
We will calculate nrm (Cn | Cn-2Cn-1). Pemp (Cn | Cn-2Cn-1) = [lambda] emp1Pemp (Cn | Cn-2Cn-1) + [lambda] emp2Pemp (Cn | Cn-1) + [lambda] emp3Pemp (Cn) Formula (2) Pnrm (Cn | Cn-2Cn-1) = [lambda] nrmlPnrm (). Cn | Cn-2Cn-1) + [lambda] nrm2Pnrm (Cn | Cn-1) + [lambda] nrm3Pnrm (Cn) Formula (3) When the learning data of Trigram is N, that is, the codes are C1, C2 ,. ．． When the CN is obtained, the re-estimation formulas for λemp1, λemp2, and λemp3 are as follows from the above-mentioned reference “Spoken Language Processing”. λemp1 = 1 / NΣ (λemp1Pemp (Cn | C
n-2C-1) / (λemp1Pemp (Cn | Cn-
2C-1) + λemp2Pemp (Cn | C-1) + λ
emp3Pemp (Cn))) λemp2 = 1 / NΣ (λemp2Pemp (Cn | C
-1) / (λemp1Pemp (Cn | Cn-2C-
1) + λemp2Pemp (Cn | C-1) + λemp
3Pemp (Cn))) λemp3 = 1 / NΣ (λemp3Pemp (Cn) /
(Λemp1Pemp (Cn | Cn-2C-1) + λe
mp2Pemp (Cn | C-1) + λemp3Pemp
(Cn))) where Σ is the sum of n = 1 to N. Similarly, λnrm1, λnrm2, and λnrm3 are obtained in the same manner.

【００２８】この例では、ラベル区間αがフレーム数Ｎ
αで得たコードがＣｉ１、Ｃｉ２、…、ＣｉＮαのと
き、このラベル区間αが強調状態となる確率Ｐα
（ｅ）、平静状態となる確率Ｐα（ｎ）は、Ｐα（ｅ）＝Ｐｅｍｐ（Ｃｉ３｜Ｃｉ１Ｃｉ２）…Ｐｅｍｐ（ＣｉＮα｜Ｃｉ（Ｎα−１）Ｃｉ（Ｎα−２））式（４）Ｐα（ｎ）＝Ｐｎｒｍ（Ｃｉ３｜Ｃｉ１Ｃｉ２）…Ｐｎｒｍ（ＣｉＮα｜Ｃｉ（Ｎα−１）Ｃｉ（Ｎα−２））式（５）となる。この計算ができるように前記のｔｒｉｇｒａ
ｍ、ｕｎｉｇｒａｍ、ｂｉｇｒａｍを任意のコードにつ
いて求めてコードブックに格納しておく。つまりコード
ブックには各コードの音声特徴量とその強調状態での出
現確率とこの例では平静状態での出現確率との組が格納
され、その強調状態での出現確率は、その音声特徴量が
過去のフレームでの音声特徴量と無関係に強調状態で出
現する確率（ｕｎｉｇｒａｍ：単独出現確率と記す）の
み、又はこれと、過去のフレームでの音声特徴量から現
在のフレームの音声特徴量に至るフレーム単位の音声特
徴量列毎に、その音声特徴量が強調状態で出現する条件
付確率との組合せの何れかであり、平静状態での出現確
率も同様に、その音声特徴量が過去のフレームでの音声
特徴量と無関係に平静状態で出現する確率（ｕｎｉｇｒ
ａｍ：単独出現確率と記す）のみ、又はこれと、過去の
フレームでの音声特徴量から現在のフレームの音声特徴
量に至るフレーム単位の音声特徴量列毎にその音声特徴
量が平静状態で出現する条件付確率と組合せの何れかで
ある。In this example, the label section α is the number of frames N
When the code obtained in α is Ci1, Ci2, ..., CiNα, the probability Pα that this label section α is in the emphasized state
(E), the probability Pα (n) of being in a calm state is as follows: Pα (e) = Pemp (Ci3 | Ci1Ci2) ... Pemp (CiNα | Ci (Nα-1) Ci (Nα-2)) Formula (4) Pα ( n) = Pnrm (Ci3 | Ci1Ci2) ... Pnrm (CiNα | Ci (Nα-1) Ci (Nα-2)) Formula (5) is obtained. To enable this calculation,
m, unigram, and bigram are obtained for arbitrary codes and stored in the codebook. That is, the codebook stores a set of the voice feature amount of each code, the appearance probability in the emphasized state, and the appearance probability in the quiet state in this example. The appearance probability in the emphasized state is the voice feature amount. Only the probability of appearing in an emphasized state irrespective of the voice feature amount in the past frame (unigram: described as a single appearance probability) or this and the voice feature amount in the past frame to the voice feature amount in the current frame For each frame-based audio feature quantity sequence, the audio feature quantity is either a combination with the conditional probability of appearing in the emphasized state, and the appearance probability in the quiet state is also the same as that of the previous frame Probability of appearing in a quiet state irrespective of the voice feature amount (unigr
am: written as a single appearance probability) or this and the voice feature quantity appears in a quiet state for each voice feature quantity sequence in frame units from the voice feature quantity in the past frame to the voice feature quantity in the current frame. It is either a conditional probability to perform or a combination.

【００２９】例えば図１０に示すようにコードブックに
は各コードＣ１、Ｃ２、…毎にその音声特徴量と、その
単独出現確率が強調状態、平静状態について、また条件
付確率が強調状態、平静状態についてそれぞれ組として
格納されている。図２０中のステップＳ３０２では、入
力音声小段落の全フレームのコードについてのそのコー
ドブックに格納されている前記確率から、発話状態の尤
度を、平静状態と強調状態について求める。図２３に実
施例の模式図を示す。時刻ｔから始まる音声小段落のう
ち、第４フレームまでを〜で示している。前記のよ
うに、ここでは、フレーム長は１００ｍｓ、フレームシ
フトを５０ｍｓとフレーム長の方を長くした。フレー
ム番号ｆ、時刻ｔ〜ｔ＋１００でコードＣｉが、フレ
ーム番号ｆ＋１、時刻ｔ＋５０〜ｔ＋１５０でコードＣ
ｊが、フレーム番号ｆ＋２、時刻ｔ＋１００〜ｔ＋２
００でコードＣｋが、フレーム番号ｆ＋３、時刻ｔ＋
１５０〜ｔ＋２５０でコードＣｌが得られ、つまりフレ
ーム順にコードがＣｉ、Ｃｊ、Ｃｋ、Ｃｌであるとき、
フレーム番号ｆ＋２以上のフレームでｔｒｉｇｒａｍが
計算できる。音声小段落ｓが強調状態となる確率をＰｓ
（ｅ）、平静状態となる確率をＰｓ（ｎ）とすると第４
フレームまでの確率はそれぞれ、Ｐｓ（ｅ）＝Ｐｅｍｐ（Ｃｋ｜ＣｉＣｊ）Ｐｅｍｐ（Ｃｌ｜ＣｊＣｋ）式（６）Ｐｓ（ｎ）＝Ｐｎｒｍ（Ｃｋ｜ＣｉＣｊ）Ｐｎｒｍ（Ｃｌ｜ＣｊＣｋ）式（７）となる。ただし、この例では、コードブックからＣｋ、
Ｃｌの強調状態及び平静状態の各単独出現確率を求め、
またＣｊの次にＣｋが強調状態及び平静状態で各出現す
る条件付確率、更にＣｋがＣｉ、Ｃｊの次に、ＣｌがＣ
ｊ、Ｃｋの次にそれぞれ強調状態及び平静状態でそれぞ
れ出現する条件付確率をコードブックから求めると、以
下のようになる。Ｐｅｍｐ（Ｃｋ｜ＣｉＣｊ）＝λｅｍｐ１Ｐｅｍｐ（Ｃｋ｜ＣｉＣｊ）＋λｅｍｐ２Ｐｅｍｐ（Ｃｋ｜Ｃｊ）＋λｅｍｐ３Ｐｅｍｐ（Ｃｋ）式（８）Ｐｅｍｐ（Ｃｌ｜ＣｊＣｋ）＝λｅｍｐ１Ｐｅｍｐ（Ｃｌ｜ＣｊＣｋ）＋λｅｍｐ２Ｐｅｍｐ（Ｃｌ｜Ｃｋ）＋λｅｍｐ３Ｐｅｍｐ（Ｃｌ）式（９）Ｐｎｒｍ（Ｃｋ｜ＣｉＣｊ）＝λｎｒｍ１Ｐｎｒｍ（Ｃｋ｜ＣｉＣｊ）＋λｎｒｍ２Ｐｎｒｍ（Ｃｋ｜Ｃｊ）＋λｎｒｍ３Ｐｎｒｍ（Ｃｋ）式（１０）Ｐｎｒｍ（Ｃｌ｜ＣｊＣｋ）＝λｎｒｍ１Ｐｎｒｍ（Ｃｌ｜ＣｊＣｋ）＋λｎｒｍ２Ｐｎｒｍ（Ｃｌ｜Ｃｋ）＋λｎｒｍ３Ｐｎｒｍ（Ｃｌ）式（１１）上記（８）〜（１１）式を用いて（６）式と（７）式で
示される第４フレームまでの強調状態となる確率Ｐｓ
（ｅ）と、平静状態となる確率Ｐｓ（ｎ）が求まる。こ
こで、Ｐｅｍｐ（Ｃｋ｜ＣｉＣｊ）、Ｐｎｒｍ（Ｃｋ｜
ＣｉＣｊ）はフレーム番号ｆ＋２において計算できる。For example, as shown in FIG. 10, for each code C1, C2, ... Each state is stored as a set. In step S302 in FIG. 20, the likelihood of the utterance state is calculated for the calm state and the emphasized state from the probabilities stored in the codebook for the codes of all the frames of the input speech sub-paragraph. FIG. 23 shows a schematic diagram of the example. Among the audio sub-paragraphs starting from time t, up to the fourth frame are indicated by ˜. As described above, here, the frame length is 100 ms and the frame shift is 50 ms, which is longer. Code Ci at frame number f and times t to t + 100, code C at frame number f + 1 and times t + 50 to t + 150
j is the frame number f + 2, time t + 100 to t + 2
At 00, code Ck is frame number f + 3, time t +
When the code Cl is obtained at 150 to t + 250, that is, when the code is Ci, Cj, Ck, Cl in the frame order,
The trigram can be calculated for frames with frame numbers f + 2 and above. Ps is the probability that the voice sub-paragraph s is in the emphasized state.
(E), the probability of being in a calm state is Ps (n)
The probabilities up to the frame are: Ps (e) = Pemp (Ck | CiCj) Pemp (Cl | CjCk) Formula (6) Ps (n) = Pnrm (Ck | CiCj) Pnrm (Cl | CjCk) Formula (7) Become. However, in this example, Ck,
Obtaining the individual appearance probabilities of the Cl emphasized state and the calm state,
The conditional probability that Ck appears next to Cj in the emphasized state and the stationary state, Ck is Ci, and Cj is C next to Cj.
The conditional probabilities that appear in the emphasized state and the calm state next to j and Ck, respectively, are obtained from the codebook as follows. Pemp (Ck | CiCj) = λemp1Pemp (Ck | CiCj) + λem p2Pemp (Ck | Cj) + λemp3Pemp (Ck) Formula (8) Pemp (Cl | CjCmpλ = λemp1Pemp (Cl | CjCk) + λemp (Pe) (λ | emp)) Cl) Formula (9) Pnrm (Ck | CiCj) = λnrm1Pnrm (Ck | CiCj) + λnr m2Pnrm (Ck | Cj) + λnrm3Pnrm (Ck) Formula (10) Pnrm (ClrCrCr) Prrm (Cr) Crn (CrCm) CnCrm) Cl | Ck) + λnrm3Pnrm (Cl) Expression (11) Probability Ps of being in the emphasized state up to the fourth frame shown in Expressions (6) and (7) using Expressions (8) to (11) above.
From (e), the probability Ps (n) of being in a calm state is obtained. Here, Pemp (Ck | CiCj) and Pnrm (Ck |
CiCj) can be calculated at frame number f + 2.

【００３０】この例では、音声小段落ｓがフレーム数Ｎ
ｓで得たコードがＣｉ１、Ｃｉ２、…、ＣｉＮｓのと
き、この音声小段落ｓが強調状態になる確率Ｐｓ（ｅ）
と平静状態になる確率Ｐｓ（ｎ）を次式により計算す
る。Ｐｓ（ｅ）＝Ｐｅｍｐ（Ｃｉ３｜Ｃｉ１Ｃｉ２）…Ｐｅ
ｍｐ（ＣｉＮｓ｜Ｃｉ（Ｎｓ−１）Ｃｉ（Ｎｓ−２））Ｐｓ（ｎ）＝Ｐｎｒｍ（Ｃｉ３｜Ｃｉ１Ｃｉ２）…Ｐｎ
ｒｍ（ＣｉＮｓ｜Ｃｉ（Ｎｓ−１）Ｃｉ（Ｎｓ−２））この例ではこれらの確率が、Ｐｓ（ｅ）＞Ｐｓ（ｎ）で
あれば、その音声小段落Ｓは強調状態、Ｐｓ（ｎ）＞Ｐ
ｓ（ｅ）であれば平静状態とする。In this example, the audio sub-paragraph s has N frames.
When the code obtained in s is Ci1, Ci2, ..., CiNs, the probability Ps (e) that this speech subsection s is in the emphasized state
And the probability Ps (n) of being in a calm state is calculated by the following equation. Ps (e) = Pemp (Ci3 | Ci1Ci2) ... Pe
mp (CiNs | Ci (Ns-1) Ci (Ns-2)) Ps (n) = Pnrm (Ci3 | Ci1Ci2) ... Pn
rm (CiNs | Ci (Ns-1) Ci (Ns-2)) In this example, if these probabilities are Ps (e)> Ps (n), the audio subsection S is in the emphasized state, Ps (n). )> P
If s (e), it is in a calm state.

【００３１】図２４は以上説明した音声小段落抽出方
法、音声段落抽出方法、各音声小段落毎に強調状態とな
る確率及び平静状態となる確率を求める方法を用いた音
声強調状態判定装置及び音声要約装置の実施形態を示
す。入力部１１に音声強調状態が判定されるべき、又は
音声の要約が検出されるべき入力音声（入力音声信号）
が入力される。入力部１１には必要に応じて入力音声信
号をディジタル信号に変換する機能も含まれる。ディジ
タル化された音声信号は必要に応じて記憶部１２に格納
される。音声特徴量抽出部１３で前述した音声特徴量が
フレーム毎に抽出される。抽出した音声特徴量は必要に
応じて、音声特徴量の平均値で規格化され、量子化部１
４で各フレームの音声特徴量がコードブック１５を参照
して量子化され、量子化された音声特徴量は強調確率計
算部１６と平静確率計算部１７に送り込まれる。コード
ブック１５は例えば図２２に示したようなものである。FIG. 24 shows a speech enhancement state determination device and speech using the speech sub-paragraph extraction method, the speech paragraph extraction method, and the method for obtaining the probability of being in the emphasized state and the probability of being in the quiet state for each speech sub-paragraph described above. 1 illustrates an embodiment of a summarizing device. Input voice (input voice signal) for which the voice enhancement state should be determined in the input unit 11 or the voice summary should be detected
Is entered. The input unit 11 also includes a function of converting an input voice signal into a digital signal as needed. The digitized voice signal is stored in the storage unit 12 as needed. The voice feature amount extraction unit 13 extracts the voice feature amount described above for each frame. The extracted voice feature amount is normalized by the average value of the voice feature amount as necessary, and the quantization unit 1
In 4, the voice feature amount of each frame is quantized with reference to the codebook 15, and the quantized voice feature amount is sent to the emphasis probability calculation unit 16 and the quietness probability calculation unit 17. The code book 15 is, for example, as shown in FIG.

【００３２】強調確率計算部１６によりその量子化され
た音声特徴量の強調状態での出現確率が、コードブック
１５に格納されている対応する確率を用いて、例えば式
（８）又は（９）により計算される。同様に平静確率計
算部１７により、前記量子化された音声特徴量の平静状
態での出現確率がコードブック１５に格納されている対
応する確率を用いて、例えば式（１０）又は（１１）に
より計算される。強調確率計算部１６及び平静確率計算
部１７で各フレーム毎に算出された強調状態での出現率
と平静状態での出現確率及び各フレームの音声特徴量は
各フレームに付与したフレーム番号と共に記憶部12に格
納する。The appearance probability in the emphasized state of the quantized speech feature quantity by the emphasis probability calculation unit 16 is calculated by using the corresponding probability stored in the codebook 15, for example, equation (8) or (9). Calculated by Similarly, the quietness probability calculation unit 17 uses the corresponding probability that the appearance probability of the quantized speech feature amount in a quiet state is stored in the codebook 15, for example, according to Expression (10) or (11). Calculated. The appearance rate in the emphasized state, the appearance probability in the quiet state, and the voice feature amount of each frame calculated for each frame by the emphasis probability calculation unit 16 and the quietness probability calculation unit 17 are stored in the storage unit together with the frame number assigned to each frame. Store in 12.

【００３３】これら各部の制御は制御部１９の制御のも
とに順次行われる。音声要約装置の実施形態は、図２４
中に実線ブロックに対し、破線ブロックが付加される。
つまり記憶部１２に格納されている各フレームの音声特
徴量が無音区間判定部２１と有音区間判定部２２に送り
込まれ、無音区間判定部２１により各フレーム毎に無音
区間か否かが判定され、また有音区間判定部２２により
各フレーム毎に有声区間か否かが判定される。これらの
無音区間判定結果と有音区間判定結果が音声小段落判定
部２３に入力される。音声小段落判定部２３はこれら無
音区間判定、有声区間判定に基づき、先の方法の実施形
態で説明したように所定フレーム数を連続する無音区間
に囲まれた有声区間を含む部分が音声小段落と判定す
る。音声小段落判定部２３の判定結果は記憶部１２に書
き込まれ、記憶部１２に格納されている音声データ列に
付記され、無音区間で囲まれたフレーム群に音声小段落
番号列を付与する。これと共に音声小段落判定部２３の
判定結果は末尾音声小段落判定部２４に入力される。The control of each of these units is sequentially performed under the control of the control unit 19. An embodiment of the voice summarization device is shown in FIG.
A broken line block is added to the solid line block.
That is, the audio feature amount of each frame stored in the storage unit 12 is sent to the silent section determination unit 21 and the sound section determination unit 22, and the silent section determination unit 21 determines whether each frame is a silent section or not. Also, the voiced section determination unit 22 determines for each frame whether or not it is a voiced section. The silent segment determination result and the voiced segment determination result are input to the audio sub-paragraph determining unit 23. Based on the silent section determination and the voiced section determination, the speech subsection determining unit 23 determines that a portion including a voiced section surrounded by a continuous silent section of a predetermined number of frames is a speech subsection based on the determination of the voiced section. To determine. The determination result of the audio sub-paragraph determining unit 23 is written in the storage unit 12, added to the audio data sequence stored in the storage unit 12, and the audio sub-paragraph number sequence is given to the frame group surrounded by the silent section. At the same time, the determination result of the audio sub-paragraph determination unit 23 is input to the final audio sub-paragraph determination unit 24.

【００３４】末尾音声小段落判定部２４では、例えば図
１９を参照して説明した手法により末尾音声小段落が検
出され、末尾音声小段落判定結果が音声段落判定部２５
に入力され、音声段落判定部２５により２つの末尾音声
小段落間の複数の音声小段落を含む部分を音声段落と判
定する。この音声段落判定結果も記憶部１２に書き込ま
れ、記憶部１２に記憶している音声小段落番号列に音声
段落列番号を付与する。音声要約装置として動作する場
合、強調確率計算部１６及び平静確率計算部１７では記
憶部１２から各音声小段落を構成する各フレームの強調
確率と平静確率を読み出し、各音声小段落毎の確率が例
えば式（８）及び式（１０）により計算される。強調状
態判定部１８ではこの音声小段落毎の確率計算値を比較
して、その音声小段落が強調状態か否かを判定し、要約
区間取出し部２６では音声段落中の１つの音声小段落で
も強調状態と判定されたものがあればその音声小段落を
含む音声段落を取り出す。各部の制御は制御部１９によ
り行われる。The trailing voice sub-paragraph determining section 24 detects the trailing voice sub-paragraph by the method described with reference to FIG.
The voice paragraph determination unit 25 determines that a portion including a plurality of voice sub-paragraphs between two end voice sub-paragraphs is a voice paragraph. This audio paragraph determination result is also written in the storage unit 12, and an audio paragraph sequence number is given to the audio small paragraph number sequence stored in the storage unit 12. When operating as a voice summarizing device, the emphasis probability calculation unit 16 and the quietness probability calculation unit 17 read the emphasis probability and the quietness probability of each frame forming each audio subparagraph from the storage unit 12, and the probability of each audio subparagraph is calculated. For example, it is calculated by equation (8) and equation (10). The emphasis state determination unit 18 compares the probability calculation values for each audio sub-paragraph to determine whether or not the audio sub-paragraph is in the emphasized state, and the summary segment extraction unit 26 determines even one audio sub-paragraph in the audio paragraphs. If there is one that is determined to be in the emphasized state, the audio paragraph including the audio sub-paragraph is extracted. The control of each unit is performed by the control unit 19.

【００３５】以上により音声で構成される音声波形を音
声小段落及び音声段落に分離する方法及び各音声小段落
毎に強調状態となる確率及び平静状態となる確率を算出
できることが理解できよう。以下では上述した各方法を
利用したこの発明による音声処理方法、音声処理装置及
び音声処理プログラムに関わる実施の形態を説明する。
図２５にこの発明の音声処理方法の実施の形態の基本手
順を示す。この実施例ではステップＳ１１で音声強調確
率算出処理を実行し、音声小段落の強調確率及び平静確
率を求める。From the above, it can be understood that the method of separating the speech waveform composed of speech into the speech sub-paragraphs and the speech sub-paragraphs, and the probability of being in the emphasized state and the probability of being in the quiet state can be calculated for each of the speech sub-paragraphs. Embodiments relating to a voice processing method, a voice processing device, and a voice processing program according to the present invention using the above-described methods will be described below.
FIG. 25 shows the basic procedure of the embodiment of the voice processing method of the present invention. In this embodiment, the voice emphasis probability calculation process is executed in step S11 to obtain the emphasis probability and the quietness probability of the voice sub-paragraph.

【００３６】ステップＳ１２では要約条件入力ステップ
Ｓ１２を実行する。この要約条件入力ステップＳ１２で
は例えば利用者に要約時間又は要約率或は圧縮率の入力
を促す情報を提供し、要約時間又は要約率或は要約率又
は圧縮率を入力させる。尚、予め設定された複数の要約
時間又は要約率、圧縮率の中から一つを選択する入力方
法を採ることもできる。ステップＳ１３では抽出条件の
変更を繰り返す動作を実行し、ステップＳ１２の要約条
件入力ステップＳ１２で入力された要約時間又は要約
率、圧縮率を満たす抽出条件を決定する。In step S12, a summary condition input step S12 is executed. In this summarization condition input step S12, for example, information for prompting the user to input the summarization time or summarization rate or compression rate is provided, and the summarization time or summarization rate or summarization rate or compression rate is input. It is also possible to adopt an input method of selecting one from a plurality of preset summarization times or summarization rates and compression rates. In step S13, the operation of repeating the change of the extraction condition is executed, and the extraction condition satisfying the summarization time or the summarization ratio and the compression ratio input in the summarization condition input step S12 of step S12 is determined.

【００３７】ステップＳ１４で要約抽出ステップを実行
する。この要約抽出ステップＳ１４では抽出条件変更ス
テップＳ１３で決定した抽出条件を用いて採用すべき音
声段落を決定し、この採用すべき音声段落の総延長時間
を計算する。ステップ１５では要約再生処理を実行し、
要約抽出ステップＳ１４で抽出した音声段落列を再生す
る。図２６は図２５に示した音声強調確率算出ステップ
の詳細を示す。ステップＳ１０１で要約対象とする音声
波形列を音声小段落に分離する。In step S14, the abstract extraction step is executed. In this abstract extraction step S14, the voice paragraph to be adopted is determined using the extraction condition determined in the extraction condition change step S13, and the total extension time of this voice paragraph to be adopted is calculated. In step 15, a summary reproduction process is executed,
The voice paragraph string extracted in the abstract extraction step S14 is reproduced. FIG. 26 shows details of the voice enhancement probability calculating step shown in FIG. In step S101, the speech waveform string to be summarized is separated into speech sub-paragraphs.

【００３８】ステップＳ１０２ではステップＳ１０１で
分離した音声小段落列から音声段落を抽出する。音声段
落とは図１９で説明したように、１つ以上の音声小段落
で構成され、意味を理解できる単位である。ステップＳ
１０３及びステップＳ１０４でステップＳ１０１で抽出
した音声小段落毎に図２２で説明したコードブックと前
記した式（８）、（１０）等を利用して各音声小段落が
強調状態となる確率（以下強調確率と称す）Ｐｓ（ｅ）
と、平静状態となる確率（以下平静確率と称す）Ｐｓ
（ｎ）とを求める。In step S102, a voice paragraph is extracted from the voice sub-paragraph string separated in step S101. As described with reference to FIG. 19, the audio paragraph is a unit which is composed of one or more audio sub-paragraphs and whose meaning can be understood. Step S
103 and the probability that each audio subparagraph will be in an emphasized state by using the codebook described in FIG. This is called the emphasis probability) Ps (e)
And the probability of being in a calm state (hereinafter referred to as calm probability) Ps
(N) is obtained.

【００３９】ステップＳ１０５ではステップＳ１０３及
びＳ１０４において各音声小段落毎に求めた強調確率Ｐ
ｓ（ｅ）と平静確率Ｐｓ（ｎ）などを各音声小段落毎に
仕分けして記憶手段に音声強調確率テーブルとして格納
する。図２７に記憶手段に格納した音声強調確率テーブ
ルの一例を示す。図２７に示すＦ１、Ｆ２、Ｆ３…は音
声小段落毎に求めた音声小段落強調確率Ｐｓ（ｅ）と、
音声小段落平静確率Ｐｓ（ｎ）を記録した小段落確率記
憶部を示す。これらの小段落確率記憶部Ｆ１、Ｆ２、Ｆ
３…には各音声小段落Ｓに付された音声小段落番号ｉ
と、開始時刻（言語列の先頭から計時した時刻）終了時
刻、音声小段落強調確率、音声小段落平静確率、各音声
小段落を構成するフレーム数ｆｎ等が格納される。In step S105, the emphasis probability P obtained for each audio sub-paragraph in steps S103 and S104.
The s (e), the quietness probability Ps (n), and the like are sorted for each voice subparagraph and stored in the storage unit as a voice emphasis probability table. FIG. 27 shows an example of the voice enhancement probability table stored in the storage means. 27, F1, F2, F3 ... Show the speech sub-paragraph emphasis probability Ps (e) obtained for each speech sub-paragraph,
The small paragraph probability memory | storage part which recorded the audio small paragraph calm probability Ps (n) is shown. These subparagraph probability storage units F1, F2, F
3 ... is the audio sub-paragraph number i attached to each audio sub-paragraph S.
The start time (time counted from the beginning of the language string), the end time, the audio subparagraph emphasis probability, the audio subparagraph calm probability, the number of frames fn forming each audio subparagraph, and the like are stored.

【００４０】要約条件入力ステップＳ１２で入力する条
件としては要約すべきコンテンツの全長を１／Ｘ（Ｘは
正の整数）の時間に要約することを示す要約率Ｘ（請求
項１記載の要約率を指す）、あるいは要約時間ｔを入力
する。この要約条件の設定に対し、抽出条件変更ステッ
プＳ１３では初期値として重み係数ＷをＷ＝１に設定
し、この重み係数を要約抽出ステップＳ１４に入力す
る。要約抽出ステップＳ１４は重み係数Ｗ＝１として音
声強調確率テーブルから各音声小段落毎に格納されてい
る強調確率Ｐｓ（ｅ）と平静確率Ｐｓ（ｅ）とを比較
し、Ｗ・Ｐｓ（ｅ）＞Ｐｓ（ｎ）の関係にある音声小段落を抽出すると共に、更にこの抽
出した音声小段落を一つでも含む音声段落を抽出し、抽
出した音声段落列の総延長時間ＭＴ（分）を求める。As a condition input in the summarization condition input step S12, a summarization rate X (summarization rate according to claim 1) indicating that the total length of contents to be summarized is summarized in a time of 1 / X (X is a positive integer). Input) or the summary time t. In response to the setting of the summary condition, the weighting factor W is set to W = 1 as an initial value in the extraction condition changing step S13, and this weighting factor is input to the summary extracting step S14. The summary extraction step S14 compares the emphasis probability Ps (e) and the quietness probability Ps (e) stored for each voice sub-paragraph from the voice emphasis probability table with the weighting factor W = 1, and W · Ps (e) A voice sub-paragraph having a relation of> Ps (n) is extracted, and a voice paragraph including even one of the extracted voice sub-paragraphs is extracted to obtain a total extension time MT (minutes) of the extracted voice paragraph string. .

【００４１】抽出した音声段落列の総延長時間ＭＴ
（分）と要約条件で決めた所定の要約時間ＹＴ（分）と
を比較する。ここでＭＴ≒ＹＴ（ＹＴに対するＭＴの誤
差が例えば±数％程度の範囲）であればそのまま採用し
た音声段落列を要約音声として再生する。要約条件で設
定した要約時間ＹＴに対するコンテンツの要約した総延
長時間ＭＴとの誤差値が規定より大きく、その関係がＭ
Ｔ＞ＹＴであれば抽出した音声段落列の総延長時間ＭＴ
（分）が、要約条件で定めた要約時間ＹＴ（分）より長
いと判定し、図２５に示した抽出条件変更ステップＳ１
３を再実行させる。抽出条件変更ステップＳ１３では重
み係数がＷ＝１で抽出した音声段落列の総延長時間ＭＴ
（分）が要約条件で定めた要約時間ＹＴ（分）より「長
い」とする判定結果を受けて強調確率Ｐｓ（ｅ）に現在
値より小さい重み付け係数Ｗ（請求項１記載の所定の係
数の場合は現在値よりも大きくする）を乗算Ｗ・Ｐｓ
（ｅ）して重み付けを施す。重み係数Ｗとしては例えば
Ｗ＝１−０．００１×Ｋ（Ｋはループ回数）で求める。Total extension time MT of the extracted voice paragraph sequence
(Minutes) is compared with a predetermined summary time YT (minutes) determined by the summary condition. If MT≈YT (the error of MT with respect to YT is within a range of ± several percent, for example), the adopted audio paragraph string is reproduced as a summary audio. The error value between the summarization time YT set in the summarization condition and the summed total extension time MT of the content is larger than the stipulation, and the relationship is M.
If T> YT, the total extension time MT of the extracted audio paragraph sequence
(Minutes) is determined to be longer than the summary time YT (minutes) defined by the summary conditions, and the extraction condition changing step S1 shown in FIG.
Re-execute 3. In the extraction condition changing step S13, the total extension time MT of the voice paragraph sequence extracted when the weighting factor is W = 1.
In response to the determination result that (minutes) is “longer” than the summarization time YT (minutes) defined in the summarization conditions, the weighting coefficient W (the predetermined coefficient according to claim 1) smaller than the current value is added to the emphasis probability Ps (e). If it is larger than the current value, multiply by W · Ps
(E) Then, weighting is performed. The weighting factor W is obtained by, for example, W = 1-0.001 × K (K is the number of loops).

【００４２】つまり、音声強調確率テーブルから読み出
した音声段落列の全ての音声小段落で求められている強
調確率Ｐｓ（ｅ）の配列に１回目のループではＷ＝１−
０．００１×１で決まる重み係数Ｗ＝０．９９９を乗算
し、重み付けを施す。この重み付けされた全ての各音声
小段落の強調確率Ｗ・Ｐｓ（ｅ）と各音声小段落の平静
確率Ｐｓ（ｎ）とを比較し、Ｗ・Ｐｓ（ｅ）＞Ｐｓ
（ｎ）の関係にある音声小段落を抽出する。この抽出結
果に従って要約抽出ステップＳ１４では抽出された音声
小段落を含む音声段落を抽出し、要約音声段落列を再び
求める。これと共に、この要約音声段落列の総延長時間
ＭＴ（分）を算出し、この総延長時間ＭＴ（分）と要約
条件で定められる要約時間ＹＴ（分）とを比較する。比
較の結果がＭＴ≒ＹＴであれば、その音声段落列を要約
音声と決定し、再生する。That is, in the arrangement of the emphasis probabilities Ps (e) found in all the voice sub-paragraphs of the voice paragraph sequence read from the voice emphasis probability table, W = 1−1 in the first loop.
Weighting is performed by multiplying the weighting coefficient W = 0.999 determined by 0.001 × 1. The emphasis probabilities W · Ps (e) of all the weighted voice sub-paragraphs are compared with the quietness probabilities Ps (n) of the voice sub-paragraphs, and W · Ps (e)> Ps.
An audio sub-paragraph having a relationship of (n) is extracted. In accordance with this extraction result, in the abstract extraction step S14, the voice paragraph including the extracted voice sub-paragraph is extracted, and the summary voice paragraph string is obtained again. At the same time, the total extension time MT (minutes) of this summary voice paragraph sequence is calculated, and this total extension time MT (minutes) is compared with the summary time YT (minutes) defined by the summary condition. If the comparison result is MT.apprxeq.YT, the audio paragraph string is determined as the summary audio and reproduced.

【００４３】１回目の重み付け処理の結果が依然として
ＭＴ＞ＹＴであれば抽出条件変更ステップを、２回目の
ループとして実行させる。このとき重み係数ＷはＷ＝１
−０．００１×２で求める。全ての強調確率Ｐｓ（ｅ）
にＷ＝０．９９８の重み付けを施す。このように、ルー
プの実行を繰り返す毎にこの例では重み係数Ｗの値を徐
々に小さくするように抽出条件を変更していくことによ
りＷＰｓ（ｅ）＞Ｐｓ（ｎ）の条件を満たす音声小段落
の数を漸次減らすことができる。これにより要約条件を
満たすＭＴ≒ＹＴの状態を検出することができる。If the result of the first weighting process is still MT> YT, the extraction condition changing step is executed as a second loop. At this time, the weight coefficient W is W = 1
It is calculated by −0.001 × 2. All emphasis probabilities Ps (e)
Is weighted with W = 0.998. As described above, in this example, the extraction condition is changed such that the value of the weighting coefficient W is gradually decreased every time the loop is repeatedly executed, so that the voice amount satisfying the condition of WPs (e)> Ps (n) is reduced. You can gradually reduce the number of paragraphs. As a result, it is possible to detect the state of MT≈YT that satisfies the summary condition.

【００４４】尚、上述では要約時間ＭＴの収束条件とし
てＭＴ≒ＹＴとしたが、厳密にＭＴ＝ＹＴに収束させる
こともできる。この場合には要約条件に例えば５秒不足
している場合、あと１つの音声段落を加えると１０秒超
過してしまうが、音声段落から５秒のみ再生することで
利用者の要約条件に一致させることができる。また、こ
の５秒は強調と判定された音声小段落の付近の５秒でも
よいし、音声段落の先頭から５秒でもよい。また、上述
した初期状態でＭＴ＜ＹＴと判定された場合は重み係数
Ｗを現在値よりも小さく例えばＷ＝１−０．００１×Ｋ
として求め、この重み係数Ｗを平静確率Ｐｓ（ｎ）の配
列に乗算し、平静確率Ｐｓ（ｎ）に重み付けを施せばよ
い。また、他の方法としては初期状態でＭＴ＞ＹＴと判
定された場合に重み係数を現在値より大きくＷ＝１＋
０．００１×Ｋとし、この重み係数Ｗを平静確率Ｐｓ
（ｎ）の配列に乗算してもよい。In the above description, MT≈YT is set as the convergence condition of the summarization time MT, but it is also possible to strictly set MT = YT. In this case, if the summary condition is insufficient for 5 seconds, for example, if another voice paragraph is added, it will exceed 10 seconds, but by reproducing only 5 seconds from the voice paragraph, the summary condition of the user is met. be able to. Further, the 5 seconds may be 5 seconds near the audio sub-paragraph determined to be emphasized, or 5 seconds from the beginning of the audio paragraph. When it is determined that MT <YT in the above-mentioned initial state, the weighting factor W is smaller than the current value, for example, W = 1-0.001 × K.
Then, the weighting coefficient W is multiplied by the array of the calm probability Ps (n) to weight the calm probability Ps (n). As another method, when it is determined that MT> YT in the initial state, the weighting coefficient is made larger than the current value W = 1 +
0.001 × K, and the weighting factor W is set to the calm probability Ps.
The array of (n) may be multiplied.

【００４５】また、要約再生ステップＳ１５では要約抽
出ステップＳ１４で抽出した音声段落列を再生するもの
として説明したが、音声付の画像情報の場合、要約音声
として抽出した音声段落に対応した画像情報を切り出し
てつなぎ合わせ、音声と共に再生することによりテレビ
放送の要約、あるいは映画の要約等を行うことができ
る。また、上述では音声強調確率テーブルに格納した各
音声小段落毎に求めた強調確率又は平静確率のいずれか
一方に直接重み係数Ｗを乗算して重み付けを施すことを
説明したが、強調状態を精度良く検出するためには重み
係数Ｗに各音声小段落を構成するフレームの数Ｆ乗して
ＷFとして重み付けを行うことが望ましい。In addition, in the summary reproducing step S15, the audio paragraph string extracted in the abstract extracting step S14 is described as being reproduced. However, in the case of image information with voice, the image information corresponding to the audio paragraph extracted as the summary voice is reproduced. It is possible to summarize a television broadcast, a movie, or the like by cutting out, connecting, and playing back together with the sound. Further, in the above description, it has been described that either the emphasis probability or the quietness probability obtained for each audio sub-paragraph stored in the audio emphasis probability table is directly multiplied by the weighting coefficient W to perform weighting. For good detection, it is desirable to perform weighting as WF by multiplying the weighting coefficient W by the number F of frames forming each audio sub-paragraph.

【００４６】つまり、式（８）及び式（１０）で算出す
る条件付の強調確率Ｐｓ（ｅ）は各フレーム毎に求めた
強調状態となる確率の積を求めている。また平静状態と
なる確率Ｐｓ（ｎ）も各フレーム毎に算出した平静状態
となる確率の積を求めている。従って、例えば強調確率
Ｐｓ（ｅ）に重み付けを施すには各フレーム毎に求めた
強調状態となる確率毎に重み付け係数Ｗを乗算すれば正
しい重み付けを施したことになる。この場合には音声小
段落を構成するフレーム数をＦとすれば重み係数ＷはＷ
Fとなる。That is, the conditional emphasizing probability Ps (e) calculated by the equations (8) and (10) is a product of the probabilities of the emphasizing state obtained for each frame. Further, the probability Ps (n) of being in a calm state is also calculated by multiplying the probability of being in a calm state calculated for each frame. Therefore, for example, in order to weight the emphasis probability Ps (e), correct weighting is performed by multiplying the weighting coefficient W for each probability of the emphasis state obtained for each frame. In this case, if the number of frames forming the audio sub-paragraph is F, the weighting factor W is W
It becomes F.

【００４７】この結果、フレームの数Ｆに応じて重み付
けの影響が増減され、フレーム数の多い音声小段落ほ
ど、つまり延長時間が長い音声小段落程大きい重みが付
されることになる。但し、単に強調状態を判定するため
の抽出条件を変更すればよいのであれば各フレーム毎に
求めた強調状態となる確率の積又は平静状態となる積に
重み係数Ｗを乗算するだけでも抽出条件の変更を行うこ
とができる。従って、必ずしも重み付け係数ＷをＷFと
する必要はない。As a result, the influence of the weighting is increased or decreased according to the number F of frames, and the audio sub-paragraph having a larger number of frames, that is, the audio sub-paragraph having a longer extension time is given a larger weight. However, if it suffices to simply change the extraction condition for determining the emphasized state, the extraction condition may be obtained by multiplying the product of the probability of the emphasized state or the product of the calm state obtained for each frame by the weighting factor W. Can be changed. Therefore, the weighting coefficient W does not necessarily have to be WF.

【００４８】また、上述では抽出条件の変更手段として
音声小段落毎に求めた強調確率Ｐｓ（ｅ）又は平静確率
Ｐｓ（ｎ）に重み付けを施してＰｓ（ｅ）＞Ｐｓ（ｎ）
を満たす音声小段落の数を変化させる方法を採ったが、
他の方法として全ての音声小段落の強調確率Ｐｓ（ｅ）
と平静確率Ｐｓ（ｎ）に関してその確率比Ｐｓ（ｅ）／
Ｐｓ（ｎ）を演算し、この確率比の降順に対応する音声
信号区間（音声小段落）を累積して要約区間の和を算出
し、要約区間の時間の総和が、略所定の要約時間に合致
する場合、そのときの音声信号区間を要約区間と決定し
て要約音声を編成する方法も考えられる。Further, in the above, as the extraction condition changing means, the emphasis probability Ps (e) or the quietness probability Ps (n) obtained for each speech sub-paragraph is weighted and Ps (e)> Ps (n).
I adopted the method of changing the number of audio sub-paragraphs that satisfy
As another method, the emphasis probability Ps (e) of all audio sub-paragraphs
And the calm probability Ps (n), the probability ratio Ps (e) /
Ps (n) is calculated, and voice signal sections (voice sub-paragraphs) corresponding to the descending order of the probability ratios are accumulated to calculate the sum of the summary sections. If they match, a method of deciding the voice signal section at that time as a summary section and organizing the summary voice may be considered.

【００４９】この場合、編成した要約音声の総延長時間
が要約条件で設定した要約時間に対して過不足が生じた
場合には、強調状態にあると判定するための確率比Ｐｓ
（ｅ）／Ｐｓ（ｎ）の値を選択する閾値を変更すれば抽
出条件を変更することができる。この抽出条件変更方法
を採る場合には要約条件を満たす要約音声を編成するま
での処理を簡素化することができる利点が得られる。上
述では各音声小段落毎に求める強調確率Ｐｓ（ｅ）と平
静確率Ｐｓ（ｎ）を各フレーム毎に算出した強調状態と
なる確率の積及び平静状態となる確率の積で算出するも
のとして説明したが、他の方法として各フレーム毎に求
めた強調状態となる確率の平均値を求め、この平均値を
その音声小段落の強調確率Ｐｓ（ｅ）及び平静確率Ｐｓ
（ｎ）として用いることもできる。In this case, when the total extension time of the organized summary voices is more or less than the summarization time set in the summarization condition, the probability ratio Ps for determining the emphasized state is set.
The extraction condition can be changed by changing the threshold value for selecting the value of (e) / Ps (n). When this extraction condition changing method is adopted, it is possible to obtain an advantage that the process up to organizing the summary voice satisfying the summary condition can be simplified. In the above description, the emphasis probability Ps (e) and the quietness probability Ps (n) obtained for each audio sub-paragraph are calculated as the product of the probability of the emphasized state and the product of the probability of the calm state calculated for each frame. However, as another method, the average value of the probabilities of the emphasized state obtained for each frame is obtained, and the average value is used as the emphasis probability Ps (e) and the quietness probability Ps of the audio sub-paragraph.
It can also be used as (n).

【００５０】従って、この強調確率Ｐｓ（ｅ）及び平静
確率Ｐｓ（ｎ）の算出方法を採る場合には重み付けに用
いる重み付け係数Ｗはそのまま強調確率Ｐｓ（ｅ）又は
平静確率Ｐｓ（ｎ）に乗算すればよい。図２８を用いて
要約率を自由に設定することができる音声処理装置の実
施例を示す。この実施例では図２４に示した音声強調状
態要約装置の構成に要約条件入力部３１と、音声強調確
率テーブル３２と、強調小段落抽出部３３と、抽出条件
変更部３４と、要約区間仮判定部３５と、この要約区間
仮判定部３５の内部に要約音声の総延長時間を求める総
延長時間算出部３５Ａと、この総延長時間算出部３５Ａ
が算出した要約音声の総延長時間が要約条件入力部３１
で入力した要約時間の設定の範囲に入っているか否かを
判定する要約区間決定部３５Ｂと、要約条件に合致した
要約音声を保存し、再生する要約音声保存・再生部３５
Ｃを設けた構成とした点を特徴とするものである。Therefore, when the calculation method of the emphasis probability Ps (e) and the calm probability Ps (n) is adopted, the weighting coefficient W used for weighting is directly multiplied by the emphasis probability Ps (e) or the calm probability Ps (n). do it. An embodiment of a voice processing device capable of freely setting the summarization rate will be described with reference to FIG. In this embodiment, the summary condition input unit 31, the voice emphasis probability table 32, the emphasized sub-paragraph extraction unit 33, the extraction condition change unit 34, and the summary section provisional determination are added to the configuration of the voice emphasis state summarizing device shown in FIG. The unit 35, the total extension time calculation unit 35A for obtaining the total extension time of the summary voice inside the summary section provisional determination unit 35, and the total extension time calculation unit 35A.
The total extension time of the summary voice calculated by the summary condition input unit 31
The summary section determination unit 35B that determines whether or not the summary time that has been input in the setting range is included, and the summary voice storage / playback unit 35 that stores and reproduces the summary voice that matches the summary condition.
It is characterized in that it is provided with C.

【００５１】入力音声は図２３で説明したように、フレ
ーム毎に音声特徴量が求められ、この音声特徴量に従っ
て強調確率計算部１６と平静確率計算部１７でフレーム
毎に強調確率と、平静確率とを算出し、これら強調確率
と平静確率を各フレームに付与したフレーム番号と共に
記憶部１２に格納する。更に、このフレーム列番号に音
声小段落判定部で判定した音声小段落列に付与した音声
小段落列番号が付記され、各フレーム及び音声小段落に
アドレスが付与される。この発明による音声処理装置で
は強調確率算出部１６と平静確率算出部１７は記憶部１
２に格納している各フレームの強調確率と平静確率を読
み出し、この強調確率及び平静確率から各音声小段落毎
に強調確率Ｐｓ（ｅ）と平静確率Ｐｓ（ｎ）とを求め、
これら強調確率Ｐｓ（ｅ）と平静確率Ｐｓ（ｎ）を音声
強調テーブル３２に格納する。As described with reference to FIG. 23, the speech feature amount of the input speech is obtained for each frame, and the emphasis probability calculation unit 16 and the quietness probability calculation unit 17 calculate the enhancement probability and the quietness probability for each frame according to the speech feature amount. And are stored in the storage unit 12 together with the frame number assigned to each frame. Further, to this frame string number, the audio small paragraph string number assigned to the audio small paragraph string determined by the audio small paragraph determination unit is added, and an address is assigned to each frame and audio small paragraph. In the voice processing device according to the present invention, the emphasis probability calculation unit 16 and the quietness probability calculation unit 17 are the storage unit 1.
The emphasis probability and the quietness probability of each frame stored in 2 are read out, and the emphasis probability Ps (e) and the quietness probability Ps (n) are obtained for each speech sub-paragraph from the emphasis probability and the quietness probability,
The emphasis probability Ps (e) and the quietness probability Ps (n) are stored in the voice emphasis table 32.

【００５２】音声強調テーブル３２には各種のコンテン
ツの音声波形の音声小段落毎に求めた強調確率と平静確
率とが格納され、いつでも利用者の要求に応じて要約が
実行できる体制が整えられている。利用者は要約条件入
力部３１に要約条件を入力する。ここで言う要約条件と
は要約したいコンテンツの名称と、そのコンテンツの全
長時間に対する要約率を指す。要約率としてはコンテン
ツの全長を１／１０に要約するか、或は時間で１０分に
要約するなどの入力方法が考えられる。ここで例えば１
／１０と入力した場合は要約時間算出部３１Ａはコンテ
ンツの全長時間を１／１０した時間を算出し、その算出
した要約時間を要約区間仮判定部３５の要約区間決定部
３５Ｂに送り込む。The voice enhancement table 32 stores the enhancement probability and the quietness probability obtained for each voice sub-paragraph of the voice waveform of various contents, and is arranged so that the summary can be executed at any time according to the user's request. There is. The user inputs the summary condition into the summary condition input unit 31. The summarization condition mentioned here indicates the name of the content to be summarized and the summarization rate for the total length of the content. As the summarization rate, an input method such as summarizing the entire length of content to 1/10 or summarizing to 10 minutes in time can be considered. Here, for example, 1
When / 10 is input, the digest time calculation unit 31A calculates a time that is 1/10 of the total time of the content, and sends the calculated digest time to the digest segment determination unit 35B of the digest segment temporary determination unit 35.

【００５３】要約条件入力部３１に要約条件が入力され
たことを受けて制御部１９は要約音声の生成動作を開始
する。その開始の作業としては音声強調テーブル３２か
ら利用者が希望したコンテンツの強調確率と平静確率を
読み出す。読み出された強調確率と平静確率を強調小段
落抽出部３３に送り込み、強調状態にあると判定される
音声小段落番号を抽出する。強調状態にある音声区間を
抽出するための条件を変更する方法としては上述した強
調確率Ｐｓ（ｅ）又は平静確率Ｐｓ（ｎ）に確率比の逆
数となる重み付け係数Ｗを乗算しＷ・Ｐｓ（ｅ）＞Ｐｓ
（ｎ）の関係にある音声小段落を抽出し、音声小段落を
含む音声段落により要約音声を得る方法と、確率比Ｐｓ
（ｅ）／Ｐｓ（ｎ）を算出し、この確率比を降順に累算
して要約時間を得る方法とを用いることができる。In response to the summary condition being input to the summary condition input unit 31, the control unit 19 starts the operation of generating the summary voice. As the starting work, the emphasis probability and the quietness probability of the content desired by the user are read from the voice emphasis table 32. The read emphasis probabilities and quietness probabilities are sent to the emphasis subparagraph extraction unit 33, and the voice subparagraph numbers determined to be in the emphasis state are extracted. As a method of changing the condition for extracting the voice section in the emphasized state, the above-mentioned emphasis probability Ps (e) or the quietness probability Ps (n) is multiplied by a weighting coefficient W which is the reciprocal of the probability ratio, and W · Ps ( e)> Ps
A method of extracting a voice sub-paragraph having a relationship of (n) and obtaining a summary voice by a voice paragraph including a voice sub-paragraph, and a probability ratio Ps
(E) / Ps (n) is calculated, and the probability ratio is accumulated in descending order to obtain the digest time.

【００５４】抽出条件の初期値としては重み付けにより
抽出条件を変更する場合には重み付け係数ＷをＷ＝１と
して初期値とすることが考えられる。また、各音声小段
落毎に求めた強調確率Ｐｓ（ｅ）と平静確率Ｐｓ（ｎ）
の確率比Ｐｓ（ｅ）／Ｐｓ（ｎ）の値に応じて強調状態
と判定する場合は初期値としてその比の値が例えばＰｓ
（ｅ）／Ｐｓ（ｎ）≧１である場合を強調状態と判定す
ることが考えられる。この初期設定状態で強調状態と判
定された音声小段落番号と開始時刻、終了時刻を表わす
データを強調小段落抽出部３３から要約区間仮判定部３
５に送り込む。要約区間仮判定部３５では強調状態と判
定した強調小段落番号を含む音声段落を記憶部１２に格
納している音声段落列から検索し、抽出する。抽出した
音声段落列の総延長時間を総延長時間算出部３５Ａで算
出し、その総延長時間と要約条件で入力された要約時間
とを要約区間決定部３５Ｂで比較する。比較の結果が要
約条件を満たしていれば、その音声段落列を要約音声保
存・再生部３５Ｃで保存し、再生する。この再生動作は
強調小段落抽出部３３で強調状態と判定された音声小段
落の番号から音声段落を抽出し、その音声段落の開始時
刻と終了時刻の指定により各コンテンツの音声データ或
は映像データを読み出して要約音声及び要約映像データ
として送出する。As an initial value of the extraction condition, when changing the extraction condition by weighting, it is conceivable that the weighting coefficient W is set to W = 1 to be the initial value. Further, the emphasis probability Ps (e) and the quietness probability Ps (n) obtained for each audio sub-paragraph
When the emphasis state is determined according to the value of the probability ratio Ps (e) / Ps (n) of Ps (e) / Ps (n), the value of the ratio is, for example, Ps as an initial value.
It may be considered that the case where (e) / Ps (n) ≧ 1 is determined as the emphasized state. Data indicating the audio sub-paragraph number, start time, and end time determined to be emphasized in this initial setting state are output from the emphasized sub-paragraph extraction unit 33 to the summary section provisional determination unit 3
Send to 5. The summary section provisional determination unit 35 searches the voice paragraph string stored in the storage unit 12 for a voice paragraph including the emphasized small paragraph number determined to be in the emphasized state, and extracts the voice paragraph. The total extension time calculation unit 35A calculates the total extension time of the extracted speech paragraph sequence, and the summary section determination unit 35B compares the total extension time with the summary time input under the summary condition. If the comparison result satisfies the summarization condition, the sound paragraph sequence is stored and reproduced by the summarization sound storage / reproduction unit 35C. In this reproducing operation, an audio paragraph is extracted from the number of the audio sub-paragraph determined to be in the emphasized state by the emphasized sub-paragraph extracting unit 33, and the audio data or the video data of each content is specified by specifying the start time and the end time of the audio paragraph. Is read out and transmitted as summarized audio and summarized video data.

【００５５】要約区間決定部３５Ｂで要約条件を満たし
ていないと判定した場合は、要約区間決定部３５Ｂから
抽出条件変更部３４に抽出条件の変更指令を出力し、抽
出条件変更部３４に抽出条件の変更を行わせる。抽出条
件変更部３４は抽出条件の変更を行い、その抽出条件を
強調小段落抽出部３３に入力する。強調小段落抽出部３
３は抽出条件変更部３４から入力された抽出条件に従っ
て再び音声強調確率テーブル３２に格納されている各音
声小段落の強調確率と平静確率との比較判定を行う。When the summary section determination unit 35B determines that the summarization conditions are not satisfied, the summary section determination unit 35B outputs an extraction condition change command to the extraction condition change unit 34, and the extraction condition change unit 34 is extracted. To make changes. The extraction condition changing unit 34 changes the extraction condition and inputs the extraction condition to the emphasized small paragraph extracting unit 33. Emphasized subparagraph extraction unit 3
In accordance with the extraction condition input from the extraction condition changing unit 34, 3 again makes a comparison determination between the emphasis probability and the quietness probability of each audio subparagraph stored in the audio emphasis probability table 32.

【００５６】強調小段落抽出部３３の抽出結果は再び要
約区間仮判定部３５に送り込まれ、強調状態と判定され
た音声小段落を含む音声段落の抽出を行わせる。この抽
出された音声段落の総延長時間を算出し、その算出結果
が要約条件を満たすか否かを要約区間決定部３５Ｂで行
う。この動作が要約条件を満たすまで繰り返され、要約
条件が満たされた音声段落列が要約音声及び要約映像デ
ータとして記憶部１２から読み出されユーザ端末に配信
される。以上により音声波形を音声小段落及び音声段落
に分離する方法及び各音声小段落毎に強調状態となる確
率及び平静状態となる確率を算出できること及び音声の
要約率を自由に変更して任意の長さの要約音声を得るこ
とができることが理解できよう。The extraction result of the emphasized small paragraph extracting unit 33 is sent to the summary section temporary judging unit 35 again, and the speech paragraph including the audio small paragraph judged to be emphasized is extracted. The total extension time of the extracted voice paragraph is calculated, and the summary section determination unit 35B determines whether or not the calculation result satisfies the summary condition. This operation is repeated until the summarization condition is satisfied, and the audio paragraph string satisfying the summarization condition is read from the storage unit 12 as the summarized audio and summarized video data and distributed to the user terminal. As described above, the method of separating the audio waveform into audio sub-paragraphs and audio sub-paragraphs, the probability of being emphasized and the probability of being calm for each audio sub-paragraph can be calculated, and the voice summarization rate can be freely changed to set an arbitrary length. It can be understood that the summary voice of Sa can be obtained.

【００５７】尚、上述では要約区間の開始時刻及び終了
時刻を要約区間と判定した音声段落列の開始時刻及び終
了時刻として取り出すことを説明したが、映像付のコン
テンツの場合は要約区間と判定した音声段落列の開始時
刻と終了時刻に接近した映像信号のカット点を例えば特
開平８−３２９２４号公報記載の手段で検出し、このカ
ット点（画面の切替わりに発生する信号を利用する）の
時刻で要約区間の開始時刻及び終了時刻を規定する方法
も考えられる。このように映像信号のカット点を要約区
間の開始時刻及び終了時刻に利用した場合は、要約区間
の切替わりが画像の切替わりに同期するため、視覚上で
視認性が高まり要約の理解度を向上できる利点が得られ
る。In the above description, the start time and the end time of the summary section are extracted as the start time and the end time of the audio paragraph string that is determined to be the summary section, but in the case of the content with video, it is determined to be the summary section. The cut point of the video signal approaching the start time and end time of the audio paragraph sequence is detected by, for example, the means described in Japanese Patent Laid-Open No. 8-32924, and the time of this cut point (using a signal generated when switching screens) A method of defining the start time and end time of the summary section can be considered. In this way, when the cut points of the video signal are used for the start time and end time of the summary section, the switching of the summary section is synchronized with the switching of the image, so that the visibility is improved visually and the comprehension of the summary is improved. The advantage that can be obtained is obtained.

【００５８】以下では上述した各方法を利用したこの発
明による要約情報提供方法、要約情報提供装置及びその
プログラムに関わる実施の形態を説明する。［実施例１］これより、実施例１として、たとえば、卒
業予定の学生の採用や、派遣社員の採用、アルバイト・
パート勤務採用などの人材発掘システムにこの発明によ
る要約情報提供方法を適用した実施例を述べる。図１
に、この発明の実施例１を示す。この発明による要約情
報提供装置は属性情報として応募者等の個人情報とその
映像付音声信号を入力する求職者登録部１００と、ネッ
トワーク２００と、データセンタ３００と、このデータ
センタ３００の出力側に設けられた出力部３０９と、採
用者発掘部４００と、課金部５００とによって構成され
る。求職者登録部１００で求職者は求職者であることを
登録する。登録データはネットワーク２００を経由し
て、データセンタ３００に送り込まれ、他の属性情報と
共にデータベースに蓄積される。Embodiments relating to the summary information providing method, the summary information providing apparatus, and the program thereof according to the present invention using the above-described methods will be described below. [First Embodiment] As a first embodiment, for example, hiring students who are going to graduate, hiring temporary workers,
An embodiment in which the summary information providing method according to the present invention is applied to a human resource finding system such as part-time work recruitment will be described. Figure 1
Example 1 of the present invention is shown in FIG. The summary information providing apparatus according to the present invention is arranged on the output side of the data center 300, the network 200, the job seeker registration unit 100 for inputting personal information such as applicants and the audio signal with video as attribute information. The output unit 309, the employer excavation unit 400, and the billing unit 500 are provided. The job seeker registration unit 100 registers that the job seeker is a job seeker. The registration data is sent to the data center 300 via the network 200 and stored in the database together with other attribute information.

【００５９】採用者発掘部４００では希望属性情報とし
て採用条件情報を入力し、希望属性情報を満足する属性
情報をもつ応募者の映像付音声信号の要約部分をデータ
センタ３００から受信し、これを再生した映像乃至音声
を採用者が視聴することによって求職者を選択する手が
かりとする。課金部５００はデータセンタ３００におい
て採用者へのデータ提供、求職者からのデータ入力等の
処理に伴い課金処理を行う。例えば、データセンタから
各処理に応じた課金要求信号を受けて各利用者金融口座
における金融残高から各処理に対する対価相当分を控除
したり、データ管理者の金融口座における金融残高に利
用手数料相当分を加算する。In the recruiter excavation unit 400, recruitment condition information is input as desired attribute information, and a summary portion of the applicant's video-added audio signal having attribute information satisfying the desired attribute information is received from the data center 300, and this is received. The employer views the reproduced video or audio as a clue for selecting a job seeker. The billing unit 500 carries out a billing process in the data center 300 along with processes such as providing data to an employer and inputting data from a job seeker. For example, receiving a billing request signal corresponding to each process from the data center, deducting the consideration equivalent to each process from the financial balance in each user financial account, or the amount equivalent to the usage fee to the financial balance in the financial account of the data administrator. Is added.

【００６０】図２は求職者登録部１００の構成の一例を
示す。求職者登録部１００は、個人情報登録部１０１、
映像撮影部１０２、保存記録部１０３、データセンタ送
信部１０４とから構成される。個人情報登録部１０１で
求職者個人の属性情報を入力する。入力に用いる端末は
パーソナルコンピュータ、情報を入出力可能な家電製
品、携帯電話のいずれでもよい。図３は、個人情報登録
画面の典型的な例である。たとえば、ステップＳＩ１０
１−１で求職者の名前を入力し、ステップＳＩ１０１−
２で年齢を入力し、ステップＳＩ１０１−３で住所を入
力し、ステップＳＩ１０１−４で電話番号を入力し、ス
テップＳＩ１０１−５で希望する職種を選択し、ステッ
プＳＩ１０１−６で希望就業日数／週を選択し、ステッ
プＳＩ１０１−７で就業形態を選択し、ステップＳＩ１
０１−８で希望年収を選択し、ステップＳＩ１０１−９
で学歴を入力し、ステップＳＩ１０１−１０で免許など
を入力する。前記ステップＳＩ１０１−１からステップ
ＳＩ１０１−１０は全て選択式でもよく、記述入力式で
もよい。FIG. 2 shows an example of the configuration of the job seeker registration unit 100. The job seeker registration unit 100 includes a personal information registration unit 101,
The image capturing unit 102, the storage recording unit 103, and the data center transmitting unit 104 are included. The personal information registration unit 101 inputs the individual attribute information of the job seeker. The terminal used for input may be a personal computer, a home electric appliance capable of inputting / outputting information, or a mobile phone. FIG. 3 is a typical example of a personal information registration screen. For example, step SI10
Enter the name of the job seeker in 1-1, and step SI101-
Enter the age in step 2, enter the address in step SI101-3, enter the telephone number in step SI101-4, select the desired occupation in step SI101-5, and select the desired number of working days / week in step SI101-6. Is selected, and the work form is selected in step SI101-7, and step SI1 is selected.
Select the desired annual income in 01-8, step SI101-9
Enter the educational background with, and enter the license etc. in step SI101-10. All of the steps SI101-1 to SI101-10 may be selection expressions or description input expressions.

【００６１】図３で示した個人情報登録画面の登録内容
は、任意であり、その内容に関しては後記するデータセ
ンタ３００の運営者が設定してもよく、また採用者発掘
部４００が設定してもよい。また、全て求職者の自由な
表記にしてもよい。図４は、図２に示した映像撮影部１
０２では映像信号と音声信号を同時に撮影して求職者本
人のＰＲ画像として取得する。図４Ａは撮影機１０２−
１で、自己ＰＲを録画する様子を示す。撮影機１０２−
１は、市販のビデオカメラでも、パーソナルコンピュー
タや、携帯電話に付属した動画撮影可能なカメラでもよ
い。また、ディジタルで録画していても、アナログで録
画していてもよく、ディジタル化されている場合、圧縮
されているか否かはいずれでもよく、圧縮されていた場
合、その圧縮形式はいずれのものでもよい。The registration contents of the personal information registration screen shown in FIG. 3 are arbitrary, and the contents may be set by the operator of the data center 300, which will be described later, or by the employer excavation unit 400. Good. In addition, it is possible to use free notation for all job seekers. 4 is a block diagram of the image capturing unit 1 shown in FIG.
In 02, the video signal and the audio signal are simultaneously photographed and acquired as a PR image of the job seeker himself. FIG. 4A shows a camera 102-
1 shows how to record a self PR. Camera 102-
1 may be a commercially available video camera, a personal computer, or a camera attached to a mobile phone capable of shooting a moving image. Also, it may be digitally recorded or analog recorded, and if it is digitized, it may or may not be compressed. If it is compressed, which compression format is used? But it's okay.

【００６２】図４Ｂは求職者が撮影機１０２−１に向っ
て自己ＰＲを行なっている様子を示す。自己ＰＲで使用
する項目は、たとえば、学歴、職歴などは、後記するデ
ータセンタ３００の運用者が設定してもよく、採用者が
設定してもよい。また、全て求職者の自由な表記にして
もよい。図５にデータセンタ３００の運用者、もしくは
採用者が設定した場合の自己ＰＲ用の項目を挙げる。た
とえば、求職者は各項目をＰＲする際、「私の名前は
…」のように項目名を発言するなどのルールを決めても
よく、また各項目毎に撮影するなどのルールを設定し
て、各項目のＰＲ開始時刻を、たとえば映像の切り替わ
りで示してもよく、あるいは、前記ルールを一切決めな
くてもよい。FIG. 4B shows a job seeker performing self-promotion toward the camera 102-1. The items used in the self-PR, for example, educational background and work history may be set by the operator of the data center 300, which will be described later, or may be set by the employer. In addition, it is possible to use free notation for all job seekers. FIG. 5 shows items for self-promotion when set by the operator of the data center 300 or the employer. For example, when a job seeker advertises each item, he / she may decide a rule to say the item name such as "My name is ...", or set a rule to photograph each item. The PR start time of each item may be indicated by, for example, video switching, or the rule may not be determined at all.

【００６３】保存記録部１０３は、たとえば、パーソナ
ルコンピュータなどに撮像データをディジタル化してフ
ァイルとして保存する。この時、ディジタル化したファ
イルは圧縮されているか否かはいずれでもよく、圧縮す
る場合においても、いずれの圧縮形式でもよい。データ
センタ送信部１０４（図２）は、前記個人情報登録部１
０１で登録した個人情報と保存記録部１０３で保存した
自己ＰＲ映像を後記するデータセンタ３００へ送信す
る。送信方法としてはたとえば、ディジタル化された自
己ＰＲ映像ファイルをネットワーク２００を経由してデ
ータセンタ３００へ送信してもよい。ただし、データセ
ンタ３００で自己ＰＲビデオをディジタル化する場合、
前記保存記録部１０３におけるディジタル化して保存す
る手続きは不必要である。The storage recording unit 103 digitizes the imaged data in a personal computer or the like and stores it as a file. At this time, it does not matter whether or not the digitized file is compressed, and in the case of compression, any compression format may be used. The data center transmission unit 104 (FIG. 2) includes the personal information registration unit 1
The personal information registered in 01 and the self-PR video stored in the storage recording unit 103 are transmitted to the data center 300 described later. As a transmission method, for example, the digitized self-PR video file may be transmitted to the data center 300 via the network 200. However, when digitizing the self-PR video in the data center 300,
The procedure of digitizing and storing in the storage recording unit 103 is unnecessary.

【００６４】ネットワーク２００はインターネット、Ｌ
ＡＮ、電話回線、ＢＳ、ＣＳ、ＣＡＴＶのいずれでもよ
い。たとえば、インターネットプロバイダーなどのネッ
トワーク仲介者が運用したものでよい。図６はデータセ
ンタ３００の構成の一例を示す。データセンタ３００は
求職者データ入力部３０１、求職者個人情報データベー
ス３０２と、自己ＰＲ音声映像データベース３０３と、
採用条件入力部３０４と、検索部３０５と、自己ＰＲ音
声映像要約部３０６と、自己ＰＲ音声映像配信部３０７
と、採用者評価部３０８と、連絡部３０９とから構成さ
れる。The network 200 is the Internet, L
It may be any of AN, telephone line, BS, CS and CATV. For example, it may be operated by a network intermediary such as an internet provider. FIG. 6 shows an example of the configuration of the data center 300. The data center 300 includes a job seeker data input unit 301, a job seeker personal information database 302, a self PR audio / video database 303,
Employment condition input unit 304, search unit 305, self PR audio / video summarization unit 306, self PR audio / video delivery unit 307
And an employer evaluation unit 308 and a contact unit 309.

【００６５】求職者データ入力部３０１は、前記データ
センタ送信部１０４（図２）から送信された求職者の属
性情報と自己ＰＲ音声映像ファイルを入力し、求職者個
人情報データベース３０２と自己ＰＲ音声映像データベ
ース３０３に保存する。採用条件入力部３０４は、採用
者からの希望属性情報となる採用条件項目を入力する。
図７に、希望属性情報となる採用条件項目の例を示す。
たとえば、採用者は採用の条件として、学歴や業務経験
などを指定する。採用条件項目は、何らかのフォーマッ
トにしたがってたとえばパーソナルコンピュータで作成
してもよく、マウスなどの機器を用いて選択する方法で
あってもよい。入力はたとえば、インターネットのホー
ムページから入力するものでもよい。検索部３０５は求
職者個人情報データベース３０２から、採用条件入力部
３０４で採用者から入力された希望属性情報と一致する
属性情報を検索する。希望属性情報と一致する属性情報
がない場合、最も希望属性情報に近い属性情報を検索す
る。自己ＰＲ音声映像要約部３０６は検索部３０５で検
索した属性情報に該当する自己ＰＲビデオを要約する。The job seeker data input unit 301 inputs the job seeker attribute information and the self PR audio / video file transmitted from the data center transmission unit 104 (FIG. 2), and the job seeker personal information database 302 and the self PR voice Save in the video database 303. The employment condition input unit 304 inputs a employment condition item which is desired attribute information from the employment person.
FIG. 7 shows an example of employment condition items which are desired attribute information.
For example, an employer designates educational background, work experience, etc. as a condition for recruitment. The adoption condition items may be created according to some format, for example, on a personal computer, or may be selected using a device such as a mouse. The input may be input from a home page on the Internet, for example. The search unit 305 searches the job applicant personal information database 302 for attribute information that matches the desired attribute information input by the employer in the recruitment condition input unit 304. If there is no attribute information that matches the desired attribute information, the attribute information closest to the desired attribute information is searched. The self PR audio / video summarizing unit 306 summarizes the self PR video corresponding to the attribute information searched by the searching unit 305.

【００６６】図８に自己ＰＲ音声映像要約部３０６の一
例を示す。自己ＰＲ音声映像要約部３０６は要約条件入
力部３０６−１と、自己ＰＲ音声映像入力部３０６−２
と、映像・音声分離部３０６−３と、自己ＰＲ項目検出
部３０６−４と、カット点抽出部３０６−５と、音声認
識部３０６−６と、音声映像要約部３０６−８と、映像
編集部３０６−９とによって構成される。要約条件入力
部３０６−１は、たとえば、採用者が入力した希望属性
情報を採用者発掘部４００に相当する端末からネットワ
ーク２００を介して入力するか又はデータセンタ３００
の運用者が設定することができる。入力は例えばパーソ
ナルコンピュータ上の画面にて行うことができる。FIG. 8 shows an example of the self PR audio / video summarizing section 306. The self PR audio / video summarization unit 306 includes a summarization condition input unit 306-1 and a self PR audio / video input unit 306-2.
A video / audio separation unit 306-3, a self PR item detection unit 306-4, a cut point extraction unit 306-5, a voice recognition unit 306-6, an audio / video summarization unit 306-8, and a video editing. And a unit 306-9. The summary condition input unit 306-1 inputs, for example, the desired attribute information input by the employer from a terminal corresponding to the employer excavation unit 400 via the network 200 or the data center 300.
Can be set by the operator. Input can be made on a screen of a personal computer, for example.

【００６７】図９に要約条件入力のための画面につい
て、典型的な例を示す。要約条件としては一人あたりの
自己ＰＲ映像の視聴を、視聴時間で設定するか又は映像
のシーン数で設定するかを選択する。図９に示す例で
は、ステップＳＩ３０６−１−１を選択しており、左の
丸印がチェックされている。視聴時間を選択した場合、
ステップＳＩ３０６−１−２で、採用者の希望する視聴
時間を要約時間として入力して各求職者当たり略この視
聴時間に自己ＰＲ映像を要約することになる。図９で
は、３０秒に設定している。尚、視聴シーン数を選択し
た場合はステップＳＩ３０６−１−４で、採用者の希望
するシーン数を入力する。FIG. 9 shows a typical example of a screen for inputting summary conditions. As a summary condition, it is selected whether the viewing of the self-PR video per person is set by the viewing time or the number of scenes of the video. In the example shown in FIG. 9, step SI306-1-1 is selected and the circle on the left is checked. If you select a viewing time,
At step SI306-1-2, the viewing time desired by the employer is input as the summary time, and the self-PR video is summarized at approximately this viewing time for each job seeker. In FIG. 9, it is set to 30 seconds. When the number of viewing scenes is selected, the number of scenes desired by the employer is input in step SI306-1-4.

【００６８】自己ＰＲ映像入力部３０６−２では自己Ｐ
Ｒ音声映像データベース３０３から検索部３０５で検索
された人材の自己ＰＲ映像を入力し、映像・音声分離部
３０６−３で映像から音声を分離する。ただし、元の自
己ＰＲ映像は音声付のまま保存しておき、分離した音声
をたとえば、ハードディスクやコンピュータのメモリ上
に保存しておく。後記する音声処理を施す場合は、映像
・音声分離部３０６−３で分離した音声を用いる。また
映像処理を施す場合は音声付の自己ＰＲ映像を用いる。In the self PR video input unit 306-2, the self P
The self-PR video of the human resources searched by the search unit 305 is input from the R audio / video database 303, and the video / audio separation unit 306-3 separates the audio from the video. However, the original self-PR video is saved as it is with audio, and the separated audio is saved in, for example, a hard disk or a memory of a computer. When performing the audio processing described below, the audio separated by the video / audio separating unit 306-3 is used. When performing video processing, a self-PR video with audio is used.

【００６９】自己ＰＲ項目検出部３０６−４では自己Ｐ
Ｒ項目を検出する。自己ＰＲ映像のＰＲ項目があらかじ
めデータセンタ３００の運用者や採用者が設定している
場合、前記のように、求職者がＰＲ項目毎に撮影し一旦
撮影機の録画を停止することでその前後のフレーム間で
の映像情報の差が著しくなり、映像にカット点が出現す
る。カット点抽出部３０６−５ではこのカット点を利用
して各ＰＲの開始時刻と終了時刻を得ることができる。
また、求職者がデータセンタ３００に登録する際に、Ｐ
Ｒ項目ごとに自己ＰＲ映像ファイルを作成しておいても
よい。また、各ＰＲ項目の開始にあたって、たとえば、
図５の７番目のＰＲ項目において求職者が「これからの
仕事で一番してみたいことは…」と発話することで、音
声認識部３０６−７で各ＰＲの開始時刻と終了時刻を得
ることができる。The self PR item detection unit 306-4 uses the self P
Detect R item. When the PR item of the self-PR video is set in advance by the operator or the employer of the data center 300, as described above, the job seeker takes a picture for each PR item and temporarily stops the recording by the camera before and after that. The difference in the video information between the frames becomes remarkable, and the cut point appears in the video. The cut point extraction unit 306-5 can obtain the start time and end time of each PR using the cut points.
Also, when the job seeker registers with the data center 300, P
A self-PR video file may be created for each R item. When starting each PR item, for example,
In the 7th PR item of FIG. 5, the job seeker speaks “What I want to do most about future work ...”, and the voice recognition unit 306-7 can obtain the start time and end time of each PR. .

【００７０】音声認識方法については、たとえば、日本
国特開平８−６５８８号などに示されている。音声映像
要約部３０６−８は、上述した強調状態判定手段により
発話の強調状態を検出し、聴取して意味の理解できる単
位としての音声段落を抽出し、強調を含む音声段落をつ
なぎ合わせて先に説明した方法により要約音声を生成す
る。これと共に、要約音声区間に対応する映像を切り出
して要約映像情報を得る。図１０は自己ＰＲ映像要約の
手段の模式図である。ステップＳＩ３０６−８−１で上
述した強調確率を時系列で求め、ステップＳＩ３０６−
８−２で音声段落を抽出する。ステップＳＩ３０６−８
−３で抽出した音声段落が強調を含む音声段落である場
合、対応するステップＳＩ３０６−８−４の自己ＰＲ映
像を、自己ＰＲ要約映像に用いる候補とする。要約条件
入力部３０６−１（図８）で設定された要約条件につい
て、たとえば、図９で示した自己ＰＲ視聴時間を一人当
り３０秒で再生する条件の場合の、自己ＰＲ映像要約方
法について、図１１に示す。以下にその実施例について
述べる。The voice recognition method is described, for example, in Japanese Patent Laid-Open No. 8-6588. The audio / video summarizing unit 306-8 detects the emphasized state of the utterance by the emphasized state determination unit described above, extracts the audio paragraph as a unit whose meaning can be understood by listening, and connects the audio paragraphs including the emphasis to each other. A summary voice is generated by the method described in 1. At the same time, the video corresponding to the summary voice section is cut out to obtain the summary video information. FIG. 10 is a schematic diagram of means for self-PR video summarization. In step SI306-8-1, the above-described emphasis probabilities are obtained in time series, and step SI306-
In 8-2, a voice paragraph is extracted. Step SI306-8
If the audio paragraph extracted in -3 is an audio paragraph including emphasis, the corresponding self-PR video of step SI306-8-4 is set as a candidate to be used for the self-PR summary video. Regarding the summary condition set by the summary condition input unit 306-1 (FIG. 8), for example, regarding the self-PR video summarization method in the case where the self-PR viewing time shown in FIG. It shows in FIG. Examples will be described below.

【００７１】ステップＳＩ３０６−８−５で強調状態区
間を含む音声段落区間を抽出する。音声段落の各々の強
調確率からステップＳＩ３０６−８−６で、音声段落毎
に求められる強調確率もしくは強調確率の平静状態であ
る確率に対する確率比の降順に強調の順位を定める。ス
テップＳＩ３０６−８−７では自己ＰＲ視聴時間を例え
ば一人当り３０秒で作成するために、定められた強調の
順位ごとに音声段落毎の再生時間を累積し、与えられた
視聴時間（この場合３０秒）に最も近似するように音声
段落の数を決定する。図１１に示す例では、ステップＳ
Ｉ３０６−８−６で付与した強調の順位に従い、３つの
音声段落（可と判定した音声段落）をつなぎ合わせた場
合に、３０秒の自己ＰＲ要約映像となる例である。ステ
ップＳＩ３０６−８−８で、前記ステップＳＩ３０６−
８−７で可と判定した３つの音声段落について、再生順
序を決める。たとえば、ステップ３０６−８−８に示す
ように時系列に再生すればよく、また、前記ステップＳ
Ｉ３０６−８−６で付与した強調の順位にしたがって再
生してもよい。ステップＳＩ３０６−８−９で自己ＰＲ
要約映像を作成する。自己ＰＲ要約映像は、ステップＳ
Ｉ３０６−８−８で決定した再生順序に従って音声段落
をつなぎ合わせて生成される。At step SI306-8-5, the voice paragraph section including the emphasized state section is extracted. In step SI306-8-6, the emphasis probabilities of the respective voice paragraphs are determined in descending order of the emphasis probabilities obtained for the respective voice paragraphs or the probability ratios of the emphasis probabilities to the probabilities of being in a quiet state. In step SI306-8-7, in order to create a self-PR viewing time of, for example, 30 seconds per person, the playback time for each audio paragraph is accumulated for each defined emphasis order, and the given viewing time (30 in this case) is accumulated. Seconds) determines the number of audio paragraphs that most closely matches In the example shown in FIG. 11, step S
This is an example of a self-PR summary video of 30 seconds when three audio paragraphs (audio paragraphs determined to be acceptable) are joined in accordance with the order of emphasis given in I306-8-6. In Step SI306-8-8, the above Step SI306-
The playback order is determined for the three audio paragraphs determined to be acceptable in 8-7. For example, the reproduction may be performed in time series as shown in step 306-8-8.
The reproduction may be performed according to the order of emphasis given in I306-8-6. Self-promotion in step SI306-8-9
Create a summary video. Step S for self-promotion summary video
It is generated by connecting audio paragraphs according to the reproduction order determined in I306-8-8.

【００７２】自己ＰＲ音声映像配信部３０７（図６参
照）は自己ＰＲ映像要約部３０６で作成した自己ＰＲ要
約音声映像と個人情報を含む属性情報を採用者発掘部４
００へ送信する。採用者発掘部４００が自己ＰＲ音声映
像配信部３０７から受信した求職者の自己ＰＲ要約音声
映像を視聴した後、（求職者の映像を視聴して）採用者
の評価情報を入力し、データセンタ３００の採用者評価
受信部３０８に送信する。ここで得た採用者の評価を必
要に応じて求職者登録部１００である求職者端末に送信
する。The self-PR audio / video delivery unit 307 (see FIG. 6) uses the self-PR summary audio / video created by the self-PR video summarization unit 306 and attribute information including personal information to find the employer excavation unit 4.
Send to 00. After the recruiter excavator 400 views the job seeker's self PR summary audio / video received from the self PR audio / video distributor 307, the employer evaluation information is input (by watching the job seeker's video), and the data center is entered. It is transmitted to the employer evaluation receiving unit 308 of 300. The evaluation of the hired employees obtained here is transmitted to the job applicant terminal, which is the job applicant registration unit 100, as necessary.

【００７３】図１３は採用者発掘部４００の動作を説明
するための流れ図を示す。ステップＳＩ４０２で採用者
が採用条件を入力する。たとえば、図７のような項目に
ついて条件を入力する。ステップＳＩ４０３で自己ＰＲ
映像要約条件を入力する。たとえば、図９のように入力
する。ステップＳＩ４０４で自己ＰＲ要約映像を視聴
し、ステップＳＩ４０５で再度自己ＰＲ要約映像を視聴
するか否かを示す情報を入力する。再度自己ＰＲ要約映
像を視聴する場合、ステップＳＩ４０６で前記ステップ
ＳＩ４０２の採用条件と同じであるかを示す情報を入力
し、同じでない場合、ステップＳＩ４０２の採用条件を
入力しなおす。ステップＳＩ４０２の採用条件と同じ場
合、ステップＳＩ４０７で自己ＰＲ要約条件は同じであ
るかを示す情報を入力し、同じでない場合ステップＳＩ
４０３の自己ＰＲ映像要約条件を入力しなおす。FIG. 13 is a flow chart for explaining the operation of the employer excavation unit 400. In step SI402, the employer inputs the hiring conditions. For example, conditions are input for the items shown in FIG. Self PR in step SI403
Enter the video summary conditions. For example, input as shown in FIG. In step SI404, the self-PR summary video is viewed, and in step SI405, information indicating whether to view the self-PR summary video again is input. When viewing the self-PR summary video again, in step SI406, information indicating whether it is the same as the adoption condition of step SI402 is input. If not, the adoption condition of step SI402 is input again. If the adoption conditions in step SI402 are the same, in step SI407, information indicating whether the self-PR summarization conditions are the same is input.
The self-PR video summarization condition of 403 is input again.

【００７４】たとえば、３０秒の自己ＰＲ要約映像を視
聴し、求職者の採用に前向きに検討する場合などに、２
度目以降は６０秒の自己ＰＲ要約映像を視聴するなどの
利用法がある。ステップＳＩ４０５で再度自己ＰＲ要約
映像の視聴を希望しない場合、ステップＳＩ４０８で自
己ＰＲ映像を視聴するかを示す情報を入力し、要約でな
くもとの自己ＰＲ映像を視聴する場合、ステップＳＩ４
０９で自己ＰＲ映像を視聴し、ステップＳＩ４１０で再
度自己ＰＲ映像を視聴するかを示す情報を入力し、再度
視聴する場合は、ステップＳＩ４０９へ、視聴しない場
合は、ステップＳＩ４１１で求職者と面接するか否かを
示す情報を入力する。For example, when viewing a 30-second self-PR summary video and considering positively for hiring a job seeker, 2
After the first time, there is a usage method such as viewing a 60-second self-promotion summary video. If the user does not want to view the self-PR video again in step SI405, information indicating whether to view the self-PR video is input in step SI408. If the original self-PR video is viewed instead of the summary, step SI4
09, the self-PR video is viewed, and in step SI410, the information indicating whether or not to view the self-PR video again is input. If the self-PR video is viewed again, the process goes to step SI409. Enter information indicating whether or not.

【００７５】ステップＳＩ４０８で自己ＰＲ映像の視聴
を希望しない場合も同様にステップＳＩ４１１で求職者
と面接するか否かを示す情報を入力する。面接を希望す
る場合、ステップＳＩ４１２で求職者の端末に面接希望
を示す情報を送信する。ここで面接とは、たとえば、求
職者連絡部３０９（図６参照）から求職者に連絡し、場
所を設定して採用者と求職者が面接してもよく、また、
採用者から直接求職者へ連絡して面接場所を決定しても
よい。また、直接面接を行わず、インターネット電話な
どのネット家電装置を用いて面接を行ってもよい。Similarly, if the user does not wish to view the self-PR video in step SI408, the user inputs information indicating whether or not to interview the job seeker in step SI411. If an interview is desired, in step SI412, information indicating the interview request is transmitted to the terminal of the job seeker. Here, the interview may include, for example, contacting the job seeker from the job seeker contact section 309 (see FIG. 6), setting a location, and interviewing the employer with the job seeker.
The employer may contact the job seeker directly to determine the interview location. Instead of the direct interview, the interview may be performed using an Internet home appliance such as an Internet telephone.

【００７６】ステップＳＩ４１１で求職者との面接を希
望しない場合、ステップＳＩ４１３で求職者を採用する
かを示す情報を入力する。採用を決定する場合もステッ
プＳＩ４１２で求職者の端末へ決定情報を送信して連絡
する。採用を決定しない場合、ステップＳＩ４１４で採
用を保留することを示す情報を入力する。後程検討する
か選択する。後ほど選択することを示す信号を入力した
場合、ステップＳＩ４１５で求職者を採用することを示
す情報を保持し、保留しないことを示す情報を入力した
場合、求職者端末に不採用を示す情報を送信する。ステ
ップＳＩ４１６で採用保留者以外、まだ自己ＰＲ要約映
像を一度も見ていない求職者の自己ＰＲ要約映像を視聴
するかを示す情報を入力する。ステップＳＩ４１２で求
職者の端末に面接希望を示す情報を送信した後同様にス
テップＳＩ４１６で求職者の自己ＰＲ要約映像を視聴す
るかを示す情報を入力する。また、ステップＳＩ４１５
で求職者を採用保留にした後も同様にステップＳＩ４１
６で自己ＰＲ要約映像を視聴するかを示す情報を入力す
る。If the interview with the job seeker is not desired in step SI411, information indicating whether or not the job seeker is hired is input in step SI413. Also when deciding to hire, the decision information is transmitted to the terminal of the job seeker in step SI412. When the adoption is not decided, the information indicating that the adoption is suspended is input in step SI414. Select whether to consider later. If a signal indicating selection later is input, information indicating that the job seeker is employed is retained in step SI415, and if information indicating that the job seeker is not held is input, information indicating not employed is transmitted to the job seeker terminal. To do. In step SI416, information indicating whether to view the self-PR summary video of job seekers who have not yet viewed the self-PR summary video, other than the hiring pending person, is input. In step SI412, the information indicating the interview request is transmitted to the job seeker's terminal, and similarly in step SI416, the information indicating whether to view the job seeker's self-PR summary video is input. Also, step SI415
Similarly, after putting a job seeker on hold in step SI41
In step 6, information indicating whether to watch the self-PR summary video is input.

【００７７】別の求職者の視聴を希望する場合、ステッ
プＳＩ４０６を実行し、以降のステップを繰り返す。ス
テップＳＩ４１６で別の求職者の視聴を希望しない情報
を入力する場合、ステップＳＩ４１７で保留した求職者
の自己ＰＲ要約映像を視聴するかを示す情報を入力しス
テップＳＩ４０６の選択を行う。前記を繰り返すことに
より、採用者の希望する求職者を決定し、ステップＳＩ
４１８で終了する。図１４は課金部５００を構成するコ
ンピュータで実行される課金のための手順を示す。課金
手順は求職者登録料課金ステップＳＩ５０１と、利用者
登録料課金ステップＳＩ５０２と、自己ＰＲ要約映像視
聴料課金ステップＳＩ５０３と、自己ＰＲ映像視聴料課
金ステップＳＩ５０４と、仲介料課金ステップＳＩ５０
５とからなり、前記いずれか一つでも当てはまれば成立
する。If another job seeker wishes to watch, step SI406 is executed and the subsequent steps are repeated. When inputting information that another job seeker does not wish to view in step SI416, information indicating whether or not to view the held job seeker's self-PR summary video is input in step SI417, and step SI406 is selected. By repeating the above, the job seekers desired by the employer are determined, and step SI
It ends at 418. FIG. 14 shows a procedure for charging, which is executed by a computer constituting the charging unit 500. The charging procedure includes a job seeker registration fee charging step SI501, a user registration fee charging step SI502, a self PR summary video viewing fee charging step SI503, a self PR video viewing fee charging step SI504, and an intermediary fee charging step SI50.
5 and is satisfied if any one of the above applies.

【００７８】求職者登録料部課金ステップＳＩ５０１で
は求職者が求職のためにデータセンタに登録する際に登
録料金を課金処理する。つまり、この課金処理は例えば
求職者金融口座における金融残高から登録料金相当分を
控除し、データ管理者の金融口座における金融残高に登
録処理手数料相当分を加算する手順で実行される。登録
料としては例えば年間登録料３０００円などと設定し、
その期間中求職者はデータセンタ３００に個人情報、自
己ＰＲ映像などを登録することができる。採用者登録料
課金ステップＳＩ５０２では採用者が採用のためにデー
タセンタに登録する料金を課金処理する。この課金処理
は、例えば採用者金融口座における金融残高から登録の
ための料金相当分を控除し、データ管理者の金融口座に
おける金融残高に登録処理手数料相当分を加算する手順
で実行される。In the job seeker registration fee department charging step SI501, the registration fee is charged when the job seeker registers in the data center for job seeking. That is, this charging process is executed, for example, by a procedure of deducting the amount equivalent to the registration fee from the financial balance in the job seeker's financial account and adding the amount equivalent to the registration processing fee to the financial balance in the financial account of the data manager. As the registration fee, for example, set an annual registration fee of 3000 yen,
During that period, the job seeker can register personal information, self-PR video, etc. in the data center 300. In the employer registration fee charging step SI502, the fee for the employer to register in the data center for recruitment is charged. This billing process is executed, for example, by a procedure of deducting the amount equivalent to the registration fee from the financial balance in the employer financial account, and adding the amount equivalent to the registration processing fee to the financial balance in the financial account of the data administrator.

【００７９】この場合の登録料としては例えば、年間登
録料１００００円などと設定し、その期間中採用者はデ
ータセンタ３００に登録してある求職者情報と、求職者
の自己ＰＲ映像を視聴することができる。自己ＰＲ要約
映像視聴料課金ステップＳＩ５０３では採用者が視聴し
た求職者人数や、自己ＰＲ要約映像の視聴時間に応じて
課金処理する。この課金処理も、例えば採用者金融口座
における金融残高から視聴人数又は視聴時間に対応した
利用料金相当分を控除し、データ管理者の金融口座にお
ける金融残高に利用手数料相当分を加算する手順で実行
される。利用料金としては例えば、（１００円／求職
者）という料金を設定してもよい。あるいは、（１００
０円／１時間）と設定してもよい。また、求職者の自己
ＰＲ要約映像を採用者が視聴する毎にその視聴料金を例
えば（１００円／採用者）と設定し、求職者の金融口座
からその視聴料金相当分を控除してもよい。In this case, the registration fee is set to, for example, an annual registration fee of 10,000 yen, and during that period, the employer views the job seeker information registered in the data center 300 and the self-promotion video of the job seeker. be able to. In the self-PR summary video viewing fee billing step SI503, billing processing is performed according to the number of job seekers viewed by the employer and the viewing time of the self-PR summary video. This billing process is also executed, for example, by deducting from the financial balance in the employer financial account the amount equivalent to the usage charge corresponding to the number of viewers or viewing time, and adding the amount equivalent to the usage fee to the financial balance in the financial account of the data administrator. To be done. As the usage fee, for example, a fee of (100 yen / job seeker) may be set. Alternatively, (100
0 yen / 1 hour) may be set. Also, each time an employer views the self-promotion summary video of the job seeker, the viewing fee may be set to, for example, (100 yen / employer), and the viewing fee equivalent amount may be deducted from the job seeker's financial account. .

【００８０】自己ＰＲ映像視聴料課金ステップＳＩ５０
４では採用者が自己ＰＲ映像を視聴した場合、採用者が
視聴した求職者人数や、自己ＰＲ映像の視聴時間に応じ
た利用料金相当分を採用者に課金処理する。この課金処
理は採用者の金融口座の残高データから利用料金相当分
データを控除し、その利用料金相当分データをデータ管
理者の金融口座の残高に加算する手順で実行する。自己
ＰＲ映像の利用料金としては例えば（１０００円／求職
者）と設定することができる。または（１０００円／１
時間）に設定してもよい。また、自己ＰＲ要約映像を採
用者が視聴する際に（１０００円／採用者）と設定し、
この場合には自己ＰＲ要約映像を視聴された求職者に視
聴した採用者の人数分の利用料金を課金してもよい。Self-PR video viewing fee charging step SI50
In No. 4, when the employer views the self-PR video, the employer is charged an amount corresponding to the usage fee according to the number of job seekers the viewer has watched and the viewing time of the self-PR video. This charging process is executed by a procedure of subtracting the usage charge equivalent data from the employer financial account balance data and adding the usage charge equivalent data to the data manager financial account balance. The usage fee for the self-PR video can be set to, for example, (1000 yen / job seeker). Or (1000 yen / 1
Time). Also, when the employer views the self-PR summary video, set it as (1000 yen / employer),
In this case, a job seeker who has viewed the self-PR summary video may be charged a usage fee corresponding to the number of the hired employees.

【００８１】仲介料課金ステップＳＩ５０５では図１３
に示したステップＳＩ４１２の求職者へ連絡する場合に
課金し、たとえば、面接１回につき１０００円を採用者
に課金するなどする。あるいは、採用が決定する毎に１
００００円を採用者に課金する、または１００００円を
求職者への課金処理を行う形態であってもよい。この結
果から明らかな様に、従来の技術に比べて、音声要約技
術を用いることで、自己ＰＲ映像の要約が可能となる改
善があった。また、採用者の希望する時間やシーン数で
自己ＰＲ要約映像を視聴することが可能となり、採用者
の採用活動の稼動を軽減する改善があった。また、採用
者は希望する求職者の映像を任意に視聴することが可能
となり、求職者の映像を採用者に強く印象付けることが
可能となり、採用活動の効率化が出来る改善があった。
また、求職者は自己ＰＲ映像を採用者に視聴され、テキ
スト情報以外に自己アピールを行うことが可能となり、
テキスト情報に依存しない求職活動が出来る改善があっ
た。［実施例２］実施例１の応用としてこれより実施例２に
ついて図１５及び図１６を用いて説明する。In the intermediary fee charging step SI505, FIG.
A fee is charged when contacting the job seeker in step SI412 shown in (4), for example, the employer is charged 1000 yen per interview. Or 1 for each hiring decision
The form may be such that 0000 yen is charged to the employer, or 10000 yen is charged to the job seeker. As is clear from this result, there is an improvement in enabling self-PR video summarization by using the audio summarization technique as compared with the conventional technique. Further, it becomes possible to watch the self-PR summary video at the time and the number of scenes that the employer desires, and there is an improvement that reduces the operation of the recruiting activity of the employer. In addition, the recruiter can freely watch the video of the job seeker who wants, and the image of the job seeker can be strongly impressed to the employer, which improves the efficiency of the recruiting activity.
In addition, job seekers will be able to view self-PR videos by employers and make self-appeal in addition to text information.
There was an improvement that job hunting activities that did not depend on text information were possible. [Second Embodiment] As an application of the first embodiment, a second embodiment will now be described with reference to FIGS. 15 and 16.

【００８２】図１５に示す要約条件入力部３０６−１
と、自己ＰＲ映像入力部３０６−２と、映像・音声分離
部３０６−３と、自己ＰＲ項目検出部３０６−４と、カ
ット点抽出部３０６−５と、音声認識部３０６−６と、
音声要約部３０６−８は図８に示したものに同じであ
る。この実施例２では音声要約部３０６−８の処理の後
に、音声キーワード抽出部３０６−９と、表情抽出部３
０６−１０と、映像編集部３０８−１１の処理を施す点
と、音声認識部３０６−６における音声認識処理の後に
テキスト要約部３０６−１２の処理を施す手順とした点
を特徴とするものである。The summary condition input unit 306-1 shown in FIG.
A self PR video input unit 306-2, a video / audio separation unit 306-3, a self PR item detection unit 306-4, a cut point extraction unit 306-5, a voice recognition unit 306-6,
The voice summarizing unit 306-8 is the same as that shown in FIG. In the second embodiment, after the processing of the voice summarizing unit 306-8, the voice keyword extracting unit 306-9 and the facial expression extracting unit 3 are executed.
06-10, the point that the process of the video editing unit 308-11 is performed, and the point that the procedure of performing the process of the text summarizing unit 306-12 after the voice recognition process in the voice recognizing unit 306-6 is adopted. is there.

【００８３】音声キーワード抽出部３０６−９で実行さ
れる音声キーワード抽出ステップでは発話された言葉の
中から繰り返し強調されて発せられる単語らしさを示す
確率（キーワード尤度）を求める。キーワード抽出につ
いては、たとえば、「標準パターンの任意区間によるス
ポッティングのためのReference Interval-free連続DP
(RIFCDP)」（伊藤慶明、木下次郎、小島浩、関
進、岡隆一、信学技報、ＳＰ９５−３４、１９９５−
０６）などに示されている。表情抽出部３０６−１０で
実行される表情抽出方法については、たとえば特開平１
１−２３２４５６号公報などに示されている方法を利用
できる。無表情を基準とした時の、基本表情（怒り、嫌
悪、恐れ、悲しみ、幸福、驚き）らしさを示す確率（基
本表情尤度）の時系列を求める。In the voice keyword extracting step executed by the voice keyword extracting unit 306-9, the probability (keyword likelihood) indicating the likelihood of a word being repeatedly emphasized from the spoken words is obtained. For keyword extraction, see, for example, "Reference Interval-free continuous DP for spotting with arbitrary intervals of standard patterns.
(RIFCDP) "(Yoshiaki Ito, Jiro Kinoshita, Hiroshi Kojima, Seki
Susumu, Ryuichi Oka, IEICE Technical Report, SP95-34, 1995-
06) and the like. The facial expression extraction method executed by the facial expression extraction unit 306-10 is described in, for example, Japanese Patent Laid-Open No.
The method shown in, for example, Japanese Patent Laid-Open No. 1-2232456 can be used. A time series of probabilities (basic facial expression likelihood) showing the likelihood of basic facial expressions (anger, disgust, fear, sadness, happiness, surprise) with reference to no facial expression.

【００８４】映像編集部３０６−１１では音声要約部３
０６−８で得た強調状態と判定された音声段落区間情報
に加えて、キーワード抽出部３０６−９で得たキーワー
ド尤度が所定の第１の閾値以上の映像区間または基本表
情尤度が所定の第２の閾値以上の映像区間を要約区間と
して抽出し、この要約区間を自己ＰＲ要約映像としても
よい。図１６に実施例２の自己ＰＲ要約映像作成方法の
模式図を示す。ステップＳII３０６−１１−１は強調を
含む音声段落区間を、ステップＳII３０６−１１−２で
強調確率を、ステップＳII３０６−１−３はキーワード
尤度を、ステップＳII３０６−１１−４は基本表情尤度
をそれぞれ求めている。In the video editing unit 306-11, the audio summarizing unit 3
In addition to the audio paragraph section information determined to be the emphasized state obtained in 06-8, the video section or the basic facial expression likelihood in which the keyword likelihood obtained in the keyword extraction unit 306-9 is equal to or more than a predetermined first threshold is predetermined. It is also possible to extract a video section equal to or more than the second threshold value as a summary section and use this summary section as a self-PR summary video. FIG. 16 shows a schematic diagram of the self-PR summary video creating method of the second embodiment. Step SII306-11-1 is the speech paragraph section including emphasis, step SII306-11-2 is the emphasis probability, step SII306-1-3 is the keyword likelihood, and step SII306-11-4 is the basic facial expression likelihood. I want each.

【００８５】ステップＳII３０６−１１−２、ステップ
ＳII３０６−１１−３、ステップＳII３０６−１１−４
で求めた確率を、乗じて確率値を求め、この確率値に基
づいてステップＳII３０６−１１−１で求めた強調状態
と判定された音声段落区間から要約区間を更に抽出す
る。例えば、確率値が所定の閾値よりも大きい区間を要
約区間と決定し、順次つなぎ合わせて自己ＰＲ要約映像
を生成してもよい。確率値の乗算においては、前記３つ
の確率を各々異なる寄与率で重み付けしてもよい。たと
えば、強調確率の効果を大きく、基本表情尤度の効果を
小さくするなどして重み付けを行なってもよい。Step SII306-11-2, Step SII306-11-3, Step SII306-11-4
The probability value obtained in step SII306-11-1 is further extracted from the voice paragraph section determined to be the emphasized state obtained in step SII306-11-1 based on this probability value. For example, a section having a probability value larger than a predetermined threshold may be determined as a summary section and sequentially combined to generate a self-PR summary video. In the multiplication of probability values, the three probabilities may be weighted with different contribution rates. For example, weighting may be performed by increasing the effect of the emphasis probability and decreasing the effect of the basic facial expression likelihood.

【００８６】上述した様に、この実施例２によれば従来
の技術に比べて、キーワード尤度を用いることで、自己
ＰＲ要約映像に、求職者の言いまわし癖が含まれ、ま
た、求職に対する求職者のキーポイントを含む自己ＰＲ
要約映像を作成することが可能となる改善があった。ま
た、基本表情尤度を用いることで、映像情報にのみ含ま
れる求職者の表情を含んだ自己ＰＲ要約映像を作成する
ことが可能となり、採用者が求職者特有の表情を採用の
評価に加えることが可能となる改善があった。以上説明
したこの発明による要約情報提供方法はデータセンタ３
００を構成する計算機により本発明の要約情報提供プロ
グラムを実行して実現できる。ここで当該プログラムを
通信回線を介してダウンロードしたり、ＣＤ−ＲＯＭや
磁気ディスク等の記憶媒体からＣＰＵのような処理手段
にインストールして実行される。As described above, according to the second embodiment, the keyword likelihood is used, as compared with the conventional technique, so that the self-PR summary video includes the wording habit of the job seeker, and Self PR including key points of job seekers
There was an improvement that made it possible to create a summary video. In addition, by using the basic facial expression likelihood, it becomes possible to create a self-PR summary video that includes the job seeker's facial expression included only in the video information, and the employer adds the job seeker's unique facial expression to the recruitment evaluation. There were improvements that would be possible. The summary information providing method according to the present invention described above is used in the data center 3
This can be realized by executing the summary information providing program of the present invention by the computer constituting 00. Here, the program is downloaded via a communication line, or installed from a storage medium such as a CD-ROM or a magnetic disk to a processing unit such as a CPU and executed.

【００８７】[0087]

【発明の効果】この結果から明らかな様に、従来の技術
に比べて、音声要約技術を用いることで、自己ＰＲ映像
の要約の自動化が可能となる効果がある。また、採用者
の希望する時間やシーン数で自己ＰＲ要約映像を視聴す
ることが可能となり、採用者の採用活動の稼動を軽減す
る効果がある。また、採用者は希望する求職者の映像を
任意に視聴することが可能となり、求職者の映像を採用
者に強く印象付けることが可能となり、採用活動の効率
化が行える効果がある。また、求職者は自己ＰＲ映像を
採用者に視聴され、テキスト情報以外に自己アピールを
行うことが可能となり、テキスト情報に依存しない求職
活動が出来る効果がある。As is apparent from this result, the use of the voice summarization technique has the effect that the summarization of the self-PR video can be automated as compared with the conventional technique. Further, it becomes possible to view the self-PR summary video at the time and the number of scenes desired by the employer, which has an effect of reducing the operation of the recruiting activity of the employer. In addition, the recruiter can freely watch the video of the job seeker who desires, and the image of the job seeker can be strongly impressed to the employer, which has the effect of increasing the efficiency of the recruiting activity. In addition, the job seeker can view the self-PR video by the employer and can make a self-appeal in addition to the text information, which has the effect of being able to perform job seeking activities that do not depend on the text information.

【００８８】また、実施例２で説明したキーワード尤度
を用いることで、自己ＰＲ要約映像に、求職者の言いま
わしや癖が含まれ、また、求職に対する求職者のキーポ
イントを含む自己ＰＲ要約映像を作成することが可能と
なる効果があった。また、基本表情尤度を用いること
で、映像情報にのみ含まれる求職者の表情を含んだ自己
ＰＲ要約映像を作成することが可能となり、採用者が求
職者特有の表情を採用の評価に加えることが可能となる
効果が得られる。Further, by using the keyword likelihood described in the second embodiment, the self-PR summary video contains the wording and habits of the job seeker, and the self PR summary including the key points of the job seeker for the job seek. There was an effect that it was possible to create a video. In addition, by using the basic facial expression likelihood, it becomes possible to create a self-PR summary video that includes the job seeker's facial expression included only in the video information, and the employer adds the job seeker's unique facial expression to the recruitment evaluation. The effect that becomes possible is obtained.

[Brief description of drawings]

【図１】この発明による要約情報配信システムの基本構
成を説明するためのブロック図。FIG. 1 is a block diagram for explaining the basic configuration of a summary information distribution system according to the present invention.

【図２】図１に示した求職者登録部の構成を説明するた
めのブロック図。FIG. 2 is a block diagram for explaining the configuration of a job seeker registration unit shown in FIG.

【図３】図２に示した求職者登録部で行なわれる個人情
報を登録する手順を説明するための流れ図。3 is a flowchart for explaining a procedure for registering personal information performed by a job seeker registration unit shown in FIG.

【図４】図２に示した映像撮影部のデータ取得状況を説
明するための図。FIG. 4 is a diagram for explaining a data acquisition situation of the image capturing unit shown in FIG.

【図５】図２に示した求職者が自己ＰＲ映像を撮影する
際の、ＰＲする項目の例を示す図。FIG. 5 is a diagram showing an example of items to be PR when the job seeker shown in FIG. 2 shoots a self-PR video.

【図６】図１に示したデータセンタの内部の構成を説明
するためのブロック図。FIG. 6 is a block diagram for explaining an internal configuration of the data center shown in FIG.

【図７】図６に示した採用条件入力部に採用条件を入力
する例を示す図。FIG. 7 is a diagram showing an example of inputting an adoption condition into an adoption condition input unit shown in FIG.

【図８】図６に示した自己ＰＲ音声映像要約部の構成の
一例を説明するためのブロック図。8 is a block diagram for explaining an example of a configuration of a self-PR audio / video summarizing unit shown in FIG.

【図９】図８に示した要約条件入力部に要約条件を入力
する例を示す図。9 is a diagram showing an example of inputting a summary condition to a summary condition input section shown in FIG.

【図１０】図８に示した音声映像要約部の動作を説明す
るための流れ図。10 is a flowchart for explaining the operation of the audio / video summarizing unit shown in FIG.

【図１１】図８に示した映像編集部の動作を説明するた
めの流れ図。FIG. 11 is a flowchart for explaining the operation of the video editing unit shown in FIG.

【図１２】図１１に示した自己ＰＲ要約音声映像情報
に、自己ＰＲ項目のテロップを入力する例を示す流れ
図。12 is a flowchart showing an example of inputting a telop of a self-PR item to the self-PR summary audio / video information shown in FIG.

【図１３】図１に示した採用者発掘部において採用者が
採用を決定するまでの手順の例を示す流れ図。FIG. 13 is a flowchart showing an example of a procedure until an employer decides to adopt in the employer excavation section shown in FIG.

【図１４】図１に示した課金部において、課金の手順の
例を示す流れ図。FIG. 14 is a flowchart showing an example of a charging procedure in the charging unit shown in FIG.

【図１５】この発明の実施例２を説明するための流れ
図。FIG. 15 is a flowchart for explaining the second embodiment of the present invention.

【図１６】図１５に示した映像編集部において、自己Ｐ
Ｒ要約音声映像を生成する例を説明するための流れ図。16 is a flow chart of the self-P in the video editing unit shown in FIG.
6 is a flowchart for explaining an example of generating an R summary audio-video.

【図１７】先に提案した音声要約方法を説明するための
フローチャート。FIG. 17 is a flowchart for explaining the previously proposed voice summarizing method.

【図１８】先に提案した音声段落の抽出方法を説明する
ためのフローチャート。FIG. 18 is a flowchart for explaining a previously proposed method of extracting a voice paragraph.

【図１９】音声段落と音声小段落の関係を説明するため
の図。FIG. 19 is a diagram for explaining the relationship between audio paragraphs and audio subparagraphs.

【図２０】図１７に示したステップＳ２における入力音
声小段落の発話状態を判定する方法の例を示すフローチ
ャート。20 is a flowchart showing an example of a method of determining the utterance state of an input voice sub-paragraph in step S2 shown in FIG.

【図２１】先に提案した音声要約方法に用いられるコー
ドブックを作成する手順の例を示すフローチャート。FIG. 21 is a flowchart showing an example of a procedure for creating a codebook used in the previously proposed speech summarization method.

【図２２】この発明において用いられるコードブックの
記憶例を示す例。FIG. 22 is an example showing a storage example of a codebook used in the present invention.

【図２３】発話状態尤度計算を説明するための波形図。FIG. 23 is a waveform chart for explaining a speech state likelihood calculation.

【図２４】先に提案した音声強調状態判定装置及び音声
要約装置の一実施例を説明するためのブロック図。FIG. 24 is a block diagram for explaining an embodiment of the previously proposed voice emphasis state determination device and voice summarization device.

【図２５】要約率を自由に変更することができる要約方
法を説明するためのフローチャート。FIG. 25 is a flowchart for explaining a summarization method in which the summarization rate can be changed freely.

【図２６】音声の要約に用いる音声小段落の抽出動作と
各音声小段落の強調確率算出動作、音声小段落平静確率
抽出動作を説明するためのフローチャート。FIG. 26 is a flowchart for explaining an operation of extracting a voice sub-paragraph used for summarizing a voice, an operation of calculating an emphasis probability of each voice sub-paragraph, and an operation of extracting a voice sub-paragraph calm probability.

【図２７】音声要約装置に用いる音声強調確率テーブル
の構成を説明するための図。FIG. 27 is a view for explaining the structure of a speech emphasis probability table used in the speech summarization device.

【図２８】要約率を自由に変更することができる音声要
約装置の一例を説明するためのブロック図。FIG. 28 is a block diagram for explaining an example of a voice summarizing device capable of freely changing the summarization rate.

[Explanation of symbols]

１００求職者登録部２００ネットワーク３００データセンタ３０１求職者データ入力部３０２求職者個人情報データベース３０３自己ＰＲ音声映像データベース３０４採用条件入力部３０５検索部３０６自己ＰＲ音声映像要約部３０７自己ＰＲ音声映像配信部３０８採用者評価受信部３０９連絡部４００採用者発掘部５００課金部 100 Job Seeker Registration Department 200 networks 300 data centers 301 Job Seeker Data Input Section 302 Job Seeker Personal Information Database 303 Self PR audiovisual database 304 Employment condition input section 305 Search section 306 Self PR audio / video summary section 307 Self PR audio / video distribution unit 308 Recruiter evaluation receiver 309 Contact 400 Recruiter excavation department 500 Billing department

フロントページの続き (51)Int.Cl.⁷ 識別記号ＦＩテーマコート゛(参考）Ｇ１０Ｌ 11/02 Ｇ１０Ｌ 3/00 ５２１Ｕ 15/00 Ｈ０４Ｎ 5/91 Ｎ 15/04 Ｇ１０Ｌ 3/00 ５５１Ｇ 15/06 ５１３Ｂ 15/10 ５３１ＮＨ０４Ｎ 5/91 ５５１Ａ５１３Ａ (72)発明者水野理東京都千代田区大手町二丁目３番１号日本電信電話株式会社内 (72)発明者桑野秀豪東京都千代田区大手町二丁目３番１号日本電信電話株式会社内 (72)発明者児島治彦東京都千代田区大手町二丁目３番１号日本電信電話株式会社内Ｆターム(参考） 5B075 ND12 ND14 NK06 NS01 UU08 5C053 FA14 JA01 JA30 LA01 5D015 CC18 FF00 Front page continuation (51) Int.Cl. ⁷ Identification code FI theme code (reference) G10L 11/02 G10L 3/00 521U 15/00 H04N 5/91 N 15/04 G10L 3/00 551G 15/06 513B 15 / 10 531N H04N 5/91 551A 513A (72) Inventor Osamu Mizuno 2-3-1, Otemachi, Chiyoda-ku, Tokyo Nihon Telegraph and Telephone Corporation (72) Inventor Hidego Kuwano Two Otemachi, Chiyoda-ku, Tokyo Nihon Telegraph and Telephone Corporation (72) Inventor Haruhiko Kojima 2-3-1 Otemachi, Chiyoda-ku, Tokyo Nihon Telegraph and Telephone Corporation F-term (reference) 5B075 ND12 ND14 NK06 NS01 UU08 5C053 FA14 JA01 JA30 LA01 5D015 CC18 FF00

Claims

[Claims]

1. A data accumulating means for accumulating an audio signal recorded simultaneously with a video signal for each item and attribute information of the audio signal, and at least a basic frequency or pitch period, power, and time of a dynamic feature amount. The desired attribute information is input by using a codebook that stores the change characteristics or the feature amount including the difference between frames and the appearance probability in the emphasized state in association with each other, and satisfies the condition indicated by the desired attribute information. The attribute information and the video signal and the audio signal for each item corresponding to the attribute information are read from the data accumulating unit, and the appearance probability in the emphasized state corresponding to the feature amount obtained by analyzing the audio signal for each frame is obtained, The probability of becoming the emphasized state is calculated based on the appearance probability in the emphasized state, and the speech signal section in which the probability of becoming the emphasized state is larger than a predetermined probability is determined as the summary section, A method for providing summary information, comprising outputting a video signal of about a section and at least a part of the read attribute information.

2. The summarization interval is a feature quantity including at least a fundamental frequency or a pitch period, power, a temporal change characteristic of a dynamic feature quantity, or a feature quantity including a difference between frames of these codebooks and an appearance probability in an emphasized state. Correspondingly stored appearance probability in a quiet state, obtain the appearance probability in a quiet state corresponding to the feature amount analyzed the audio signal for each frame, and a quiet state based on the appearance probability in the quiet state The probability ratio is calculated, the probability ratio of the probability of being in the emphasized state to the probability of being in the quiet state is calculated for each voice signal section, and the time of the voice signal section corresponding to the descending order of the probability ratio is accumulated to summarize the section. 2. The method for providing summary information according to claim 1, further comprising: calculating a sum of the times of the above, and determining a voice signal section in which the sum of the times of the summary sections is a predetermined summary time as the summary section.

3. It is determined whether or not the voice signal is a soundless section for each frame, and whether or not it is a voiced section, and a portion surrounded by a soundless section of a predetermined number of frames or more and including a voiced section is determined as a voice sub-paragraph, The average power of the voiced section included in the audio sub-paragraph is determined as a voice paragraph a voice sub-paragraph group ending with a voice sub-paragraph less than a predetermined constant multiple of the average power in the voice sub-paragraph, the voice signal section is It is defined for each audio paragraph, the summarization time is obtained by accumulating for each audio paragraph, the video signal and the audio signal of the summary section for each audio paragraph in descending order of the probability of the emphasis state or the probability ratio. The summary information providing method according to claim 1, wherein the summary information is output.

4. A data accumulating means for accumulating an audio signal recorded simultaneously with a video signal for each item and attribute information of the audio signal, and at least a fundamental frequency or pitch period, power, and a dynamic feature quantity. By using a codebook in which the time-change characteristics or the feature amount including the difference between frames and the appearance probability in the emphasized state are stored in association with each other, desired attribute information is input, and the condition indicated by the desired attribute information is set. Satisfied attribute information and item-specific video signals and audio signals corresponding to the attribute information are read from the data storage means, and the probability of appearance in the emphasized state corresponding to the feature amount obtained by analyzing the audio signal for each frame is obtained. An emphasis state probability calculation unit that calculates a probability of becoming an emphasis state based on an appearance probability in the emphasis state, and summarizes a voice signal section in which the probability of becoming the emphasis state is greater than a predetermined probability. A summary information providing device, comprising: a summary section determination unit that determines a section; and an output unit that outputs at least a part of the video signal of the summary section and the read attribute information.

5. A summary information providing program, characterized by being written in a computer-readable code, for executing any of the summary information providing methods according to any one of claims 1 to 4 on a computer.