WO2022230111A1

WO2022230111A1 - Voice communication device and call voice processing method

Info

Publication number: WO2022230111A1
Application number: PCT/JP2021/016990
Authority: WO
Inventors: 耕治鹿庭; 和彦吉澤; 展明甲; 万寿男奥
Original assignee: マクセル株式会社
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2022-11-03
Also published as: JPWO2022230111A1

Abstract

The present invention provides a voice communication device that uses telecommunications. The voice communication device includes a microphone, a communicator that transmits and receives a call voice, a booster that boosts a high frequency level in a call voice range, a caller state collection device that collects caller state information to detect whether a caller is wearing a mask, and a processor. The processor: analyzes the caller state information to detect whether the caller is wearing a mask; outputs, to the booster, a boost control signal for controlling the high frequency boost of the call voice collected by the microphone according to the result of detecting whether the caller is wearing a mask; and causes the communicator to transmit the call voice on which the boost control process has been performed.

Description

Communication device and communication voice processing method

The present invention relates to a communication device and a call voice processing method, and more particularly to a communication device and a call voice processing method that improve the difficulty of hearing even when a person wearing a mask wears a voice call.

In conversations using telecommunication means (hereinafter referred to as calls), there have long been proposals for improving call quality. For example, in Patent Document 1, "Two sound quality change modes are prepared: a noise mode that corrects voice masking when the caller is in a noisy environment, and a sound quality mode that is used when it is difficult to hear the other party's voice during a normal call. Memory. By reading out the data stored in the memory and controlling the digital equalizer, noise suppressor, volume, echo canceller, and sidetone addition circuit with this data by the arithmetic processing unit, the above noise mode or sound quality mode can be set.(Summary) Excerpt)” is stated.

Also, in recent years, there are more and more opportunities to wear masks on a daily basis in order to reduce the effects of hay fever and yellow sand, as well as prevent viral infections such as COVID-19. It has been pointed out that wearing a mask in daily life makes it difficult to hear conversations and calls because the mask blocks vocalizations.

Non-Patent Document 1 introduces a hearing aid that corrects difficulty in hearing conversations caused by masks. In the mask mode, the hearing aid boosts the high frequency range of 2 kHz to 5 kHz that is attenuated by the mask to compensate for difficulty in hearing.

Japanese Patent Application Laid-Open No. 2001-136239

In Non-Patent Document 1, the mask mode should be activated when the user determines that hearing is difficult, or when he or she confirms that the other party is wearing a mask. However, in a call, the person making the call is generally far away, and it is difficult to distinguish whether the difficulty in hearing is due to the mask or the poor communication environment. It is also not preferable from the viewpoint of smooth communication that the listener of a call confirms whether or not the speaker is wearing a mask each time. Therefore, the actual situation is that it is actually difficult for the person listening to the call to take actions to alleviate the difficulty of hearing due to the mask.

Also, in Patent Document 1, there is no mention of correcting deterioration of speech quality due to masks in the first place. Therefore, in Patent Literature 1 and Non-Patent Literature 1, there is a problem in setting the call voice quality mode by masking in a call.

The present invention was made to solve the above-mentioned problems, and aims to correct the deterioration of call quality caused by talking while wearing a mask without making the person listening to the call aware of it.

In order to solve the above problems, the present invention has the configuration described in the claims. An example of this is a communication device using telecommunications, which includes a microphone, a communication device for transmitting and receiving communication voice, a booster for boosting the high frequency level in the communication range, and a communication device. a caller state collection device for collecting caller state information to detect whether a person is wearing a mask; and a processor connected to each of the microphone, the communicator, the booster, and the caller state collection device. and the processor comprises a mask detection step of analyzing the caller state information to detect whether or not the caller is wearing a mask; A boost control step of outputting to the booster a boost control signal for controlling a high frequency boost of call voice collected by a microphone; a transmission step of transmitting the call voice after boost control processing by the booster from the communication device; is characterized by executing

According to the present invention, it is possible to correct deterioration of call quality caused by talking while wearing a mask without making the person listening to the call aware of it. Objects, configurations, and effects of the invention other than those described above will be clarified in the following embodiments.

FIG. 2 is a hardware configuration diagram of a communication device (smartphone); A functional block diagram of a program stored in a smartphone. The block diagram of the call system which concerns on 1st embodiment. The figure which shows the home screen of a smart phone. The figure which shows the dial screen displayed by a call program. The figure which shows a display screen when there is an incoming call. 4 is a flowchart showing a call voice processing method according to the first embodiment; The block diagram of the call system which concerns on 2nd embodiment. 8 is a flowchart showing a call voice processing method according to the second embodiment; 9 is a flowchart showing a call voice processing method according to the third embodiment; FIG. 4 is an explanatory diagram of classifying speech analysis processing; 6 is a flow chart showing the flow of processing in which image analysis processing and audio analysis processing are used together for mask detection processing. 1 is an external view of an HMD, which is one form of a communication device; FIG. The hardware block diagram of HMD. 4 is a functional block diagram of a program stored in the HMD; FIG. The figure explaining the call scene using HMD.

Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the following description, the same reference numerals are given to the same configurations and steps throughout the drawings, and repeated descriptions will be omitted.

[First embodiment]
A first embodiment will be described with reference to FIGS. 1 to 7. FIG. 1st embodiment mentions the smart phone 1 as an example and demonstrates it as a telephone call apparatus. In this embodiment, a camera is used as a caller status collection device for collecting caller status information to detect whether or not a caller is wearing a mask, and the caller status information is captured data obtained by imaging with the camera. In all the embodiments, a communication device using telecommunication is used, but "using telecommunication" means a communication form that converts call voice into an electric signal and transmits and receives it, and the form of transmission and reception is wired. Communication, wireless communication, packet communication, mobile communication via an IP network, etc., can be of any type. In addition, in mobile communication via an IP network, it is possible to perform a video call in which call data consisting of video and call voice are simultaneously transmitted, but the video call is also included in the call in the present embodiment.

Figure 1 is a hardware configuration diagram of a smartphone.

The smartphone 1 includes a processor 30, a storage 40, a GPS receiver 51, a geomagnetic sensor 52, an acceleration sensor 53, a gyro sensor 54, a LAN communication device 61, a mobile communication device 62, a short-range wireless communication device 63, a display 71, and an in-camera 72. , an out-camera 73 , a microphone 81 , a speaker 82 , a touch sensor 91 , an operation key 92 , and a booster 95 .

The processor 30 is a microprocessor unit that controls the entire smartphone 1 according to a predetermined operating program, and includes, for example, a CPU and an MPU (Micro Processor Unit).

The system bus 31 is a data communication path for transmitting and receiving various commands and data between the processor 30 and each component block within the smartphone 1 .

The storage 40 includes a ROM 41 that stores programs for controlling the operation of the smartphone 1, a non-volatile memory that stores various data such as operation setting values, detection values from sensors, objects including content, and library information downloaded from a library. It includes a memory 42 and a rewritable RAM 43 such as a work area used in various program operations. A flash ROM may be used as the non-volatile memory 42, or another memory medium may be used.

The storage 40 needs to retain stored information even when power is not supplied to the smartphone 1 from the outside. Therefore, instead of the nonvolatile memory 42, devices such as semiconductor element memories such as SSDs (Solid State Drives), magnetic disk drives such as HDDs (Hard Disc Drives), and the like may be used.

The storage 40 can store operation programs downloaded from the network and various data created by the operation programs. In addition, it is possible to store captured data such as moving images and still images captured using the imaging function of the in-camera 72 and the out-camera 73 .

The smartphone 1 includes a GPS (Global Positioning System) receiver 51 , a geomagnetic sensor 52 , an acceleration sensor 53 and a gyro sensor 54 . These sensors make it possible to detect the position, tilt, direction, movement, etc. of the smartphone 1 . Also, the smartphone 1 may further include other sensors such as an illuminance sensor, an altitude sensor, a proximity sensor, and the like.

The LAN communication device 61 is connected to a wide area network via an access point or the like, and transmits and receives data to and from an external server on the wide area network. The connection with an access point or the like may be made through a wireless communication connection such as Wi-Fi (registered trademark).

The mobile communication device 62 performs telephone communication (call) and data transmission/reception through wireless communication with a mobile phone base station or the like of a mobile phone communication network. Communication with mobile phone base stations, etc. is 4G (Generation), 5G mobile communication, W-CDMA (Wideband Code Division Multiple Access) (registered trademark) method, GSM (Global System for Mobile communications) method, LTE (Long Term Evolution) ) method, or any other communication method. A LAN communication device 61 and a mobile communication device 62 perform voice communication by mobile communication and connect to a wide area network by wireless LAN.

The short-range wireless communication device 63 exchanges information with external Bluetooth (registered trademark) devices and external NFC-compatible devices through Bluetooth (registered trademark) communication, NFC standard communication, and the like.

The LAN communication device 61, the mobile communication device 62, and the short-range wireless communication device 63 each have an encoding circuit, a decoding circuit, an antenna, and the like. In addition to the communication device described above, another communication device such as an infrared communication device may be provided.

The display 71 is, for example, a display device such as a backlight liquid crystal display or a self-luminous organic EL display. display the captured image data.

The in-camera 72 is provided on the same surface as the surface of the smartphone 1 on which the display 71 is provided. The in-camera 72 captures an image of a caller (who is both a speaker and a listener) using the smartphone 1 and generates facial image data of the caller. The smartphone 1 executes face recognition processing based on this face image data, and is also used for so-called Face ID, which permits the use of the smartphone 1.

The out-camera 73 is provided on the back surface of the smartphone 1 . It is also used when capturing an image of a landscape or the like.

Each of the in-camera 72 and the out-camera 73 uses an electronic device such as a CCD (Charge Coupled Device) or a CMOS (Complementary Metal Oxide Semiconductor) sensor to convert the light input from the lens into an electrical signal, thereby capturing images of surroundings and objects. It is a camera that inputs image information of an object.

The microphone 81 converts sound in the real space, user's voice, etc. into voice information and inputs it.

The speaker 82 outputs voice information and the like necessary for the user. Of course, earphones and headphones can also be connected, and it goes without saying that they can be used properly depending on the application.

The touch sensor 91 is stacked on the display screen of the display 71 .

The operation key 92 is configured by arranging button switches and the like.

The touch sensor 91 and the operation key 92 are examples of operation input devices for inputting operation instructions to the smartphone 1, and may be other operation input devices. Alternatively, the smart phone 1 may be operated using a separate portable terminal device connected by wired communication or wireless communication using the LAN communication device 61 or the short-range wireless communication device 63 .

Alternatively, the image captured by the in-camera 72 may be analyzed, and the smartphone 1 may be operated by actions such as gestures.

Fig. 2 is a functional block diagram of the program stored in the smartphone.

The non-volatile memory 42 stores a basic operation program 421 and a call program 422 as processing programs. The nonvolatile memory 42 further includes a data storage area 430 for various data used when executing each program. The processor 30 reads each program stored in the nonvolatile memory 42 and refers to data stored in the data storage area 430 as necessary.

2, the processor 30 performs mask detection when the call processing unit 4222 detects a press signal of the call application icon 22 (see FIG. 4) or an incoming call signal from another call device. Performs the functions of section 4220 . In the first embodiment, the mask detection unit 4220 detects the presence or absence of a mask by image analysis processing. Therefore, the in-camera 72 is activated, image data is acquired, image analysis processing is performed based on the image data, and mask detection processing is performed. I do.

The mask detection unit 4220 outputs the mask presence/absence detection result to the boost control unit 4221 . The boost control unit 4221 outputs a boost control signal to the booster 95 according to the mask presence/absence detection result. The booster 95 performs high-frequency boost processing on the call voice acquired from the microphone 81, and outputs and transmits the result from the call processing unit 4222 to the communication device.

In the third embodiment, voice analysis is performed on call voice collected by the microphone 81 to detect the presence or absence of a mask. Detect the presence or absence of a mask. Details of the movement of each signal and data shown in FIG. 2 will be clarified as needed in the following description.

The mobile communication network 5 (see FIG. 3) deploys the basic operation program 421 and the call program 422 in the RAM 43 and executes them.

Various data stored in the data storage area 430 are mainly data necessary for executing the call program 422 . For example, there is a caller face image (referred to as "registered face data 431") in which the caller's face is captured, a mask database (D/B) 432, registered voice data 433, and class data 434, which are necessary depending on the call voice processing method. Store accordingly. The nonvolatile memory 42 may be a single memory medium as shown, or may be composed of a plurality of memory media.

FIG. 3 is a configuration diagram of the call system according to the first embodiment. Caller 2 and caller 2A are both speakers and listeners. A call will be made using mobile communication, but Internet protocol (also referred to as IP protocol) may be used in a wireless LAN. Further, in FIG. 3, either the smartphone 1 or the smartphone 1A may be the calling device according to the present invention, or both the smartphone 1 and the smartphone 1A may be the calling devices according to the present invention.

Further, in FIG. 3, reference numerals 3 and 3A are masks,

reference numerals

4 and 4A are mobile communication base stations, reference numeral 5 is mobile communication networks,

reference numerals

6 and 6A are wireless LAN access points,

reference numerals

60 and 60A are wireless LAN signals, 7 is a wide area network. Mobile

communication base stations

4 , 4 A are main components of mobile communication network 5 and also components of wide area network 7 .

Caller 2 uses smartphone 1 to, for example, dial up and start a call. A signal for a call is transmitted to the mobile communication base station 4, and transmitted to the smartphone 1A held by the caller 2A via the mobile communication network 5 and the mobile communication base station 4A. When the caller 2A accepts the incoming call using the smartphone 1A, a call is made between the caller 2 and the caller 2A.

On the other hand, smartphone 1 and smartphone 1A are also connected to wide area network 7. There are two types of routes, a route via mobile

communication base stations

4 and 4A and a route using

access points

6 and 6A with wireless LAN signals 60 and 60A.

4 to 6 show the display screens of the smartphone 1. For the sake of explanation, the smartphone 1 operated by the caller 2 in FIG. 3 will be described below as an example, but the explanation regarding the smartphone 1 shall also apply to the smartphone 1A.

FIG. 4 is a diagram showing the home screen 20 of the smartphone 1. FIG. The call port 83 is connected to the microphone 81 and the speaker 82 . By pressing the operation key 92, the home screen 20 is displayed.

On the home screen 20, time, location information, weather information, and a search window are arranged at the top, but these are not essential. What is important on the home screen 20 is an icon button group 21 for starting an application program. By pressing the icon button corresponding to the application program, the home screen 20 is shifted to the screen of the application program.

The icon button group 21 includes a call application icon 22 that activates the call program.

Caller

2, 2A pushes the button to start the call program.

FIG. 5 is a diagram showing the dial screen 20a displayed by the calling program. A numeric keypad 23 and a dial start button 24 are displayed, and the caller uses the numeric keypad 23 to input the number of the other party to call, and presses the dial start button 24 to call the other party. In addition, as a method of calling a calling party, a telephone directory may be displayed and a calling party may be selected from the telephone directory.

Furthermore, there are cases other than pressing the call application icon 22 by the caller to activate the call program. FIG. 6 shows such a case, showing the display screen 20b when there is an incoming call.

On the display screen 20b when a call is made, there is information 27 specifying the calling party (for example, the calling party's name, telephone number, information indicating non-notification), a call acceptance button 25, and a call rejection button. 26 is displayed. When called by the other party of the call, confirm the name of the other party and press the call approval button 25 to start the call.

The dial screen 20a in FIGS. 5 and 6 and the display screen 20b when a call is made correspond to the initial screen of the call program. In these initial screens, the caller performs an action of looking at the display screen. At this time, the in-camera 72 is activated to image the face of the caller.

This processing is realized by outputting an imaging instruction signal by the in-camera 72 from the calling program 422 when the calling program 422 is activated in response to the detection of the pressing signal of the calling application icon 22 of the calling program by the processor 30 .

FIG. 7 is a flow chart showing the call voice processing method according to the first embodiment. As described in FIG. 1, each step of the call program 422 is executed by the processor 30 of the smart phone 1. FIG.

At the start of the process, the home screen 20 is displayed on the smartphone 1 (S10). If the caller presses the call application icon 22 (see FIG. 4) (S11: YES), or if a call is received (see FIG. 5) (S12: YES), the processor 30 activates the call program 422. If NO in both steps S11 and S12, the process waits in step S10.

If S11 or S12 is YES, the call program 422 executes the call process (S13-S17) and the image analysis process (S21-S24) in multi-process.

In the image analysis process (S20), the mask detection unit 4220 activates the in-camera 72 (S21) and acquires the camera image (S22). The mask detection unit 4220 performs subject detection processing on the camera image and performs face recognition processing (S23). The mask detection unit 4220 determines whether the face image includes a mask (S24: corresponds to the mask detection step).

When it is determined that there is a mask (S24: YES), a boost control signal indicating execution of boost is output to boost control section 4221. If it is determined that there is no mask (S24: NO), a boost control signal indicating no boost (non-execution of boost) is output to boost control section 4221 .

In the call process, the call processing unit 4222 sends a call request to the dialed number (S13), and when the other party gives permission, the call starts (S15). Alternatively, the call processing unit 4222 responds to the incoming call request from the communication partner (S14), and starts the call (S15).

The booster 95 performs boost control processing according to the boost control signal (S16: corresponds to the boost control step). The booster 95 uses the boost control signal output in step S24 to turn ON or OFF the amount of boost in the high frequency band, specifically 2 kHz to 5 kHz. The smartphone 1 transmits the call voice that has undergone boost control in step S16 to the mobile communication network 5 or the wide area network 7 (corresponding to a transmission step).

The call processing unit 4222 determines the end of the call (S17), ends the call with YES and returns to the home screen (S10), and continues the call with NO.

Note that the image analysis process S20 is performed at the start of the call program to perform mask detection, but it may be continued as indicated by the dashed line returning from S23 to S22 in FIG. 8 even during the execution of the call. By continuously executing this function, a change in state such as putting on or taking off a mask is detected during a call. Furthermore, in the face recognition processing of S23, the first recognized face image may be stored as the registered face data 431 in the data storage area 430, and the registered face data 431 may be referred to. To prevent an erroneous mask detection result from being obtained when the face of a person other than a caller is captured by performing mask detection on the same face image as a registered face image.

Although the booster 95 described above is realized by a hardware circuit, it is also possible for the processor 30 to perform software processing according to a program stored in the storage 40 .

As described above, according to the call device and call voice processing method of the first embodiment, it is possible to automatically set whether or not the speaker wears a mask and the high-frequency boost in a call by telecommunication. You can improve the difficulty of hearing during calls without worrying about whether you are wearing a mask.

[Second embodiment]
A second embodiment will be described with reference to FIGS. 8 and 9. FIG. FIG. 8 is a configuration diagram of a call system according to the second embodiment.

As shown in FIG. 8, the call service server 8 is connected to the wide area network 7 in the second embodiment. The call service server 8 has a mask database 8a. A call is made between the caller 2 and the caller 2A, and the smartphone 1 (or the smartphone 1A as well) uses the mask database 8a of the call service server 8. FIG.

The mask database 8a is created by the call service server 8 collecting information on masks in the wide area network 7. The mask information includes a mask image, material data, etc. Using the mask image as a search key from among the face images of the caller captured by the camera of the smartphone 1, mask data with similar mask shapes and textures is selected. . Based on information such as the material of the selected mask, the amount of sound attenuation when passing through the mask is estimated.

The mask database 8a may be stored in the call service server 8 and may be referred to. Alternatively, as shown in FIG. Good (see mask database 432 in FIG. 2).

FIG. 9 is a flowchart showing a call voice processing method according to the second embodiment.

In the image analysis process S20 of the flowchart shown in FIG. 9, the mask detection unit 4220 determines whether the face image includes a mask (S24: YES), and then refers to the

mask database

432 or 8a to determine the mask material and the like. Search for information (S25).

The amount of boost for each level corresponds to multiple mask materials. The search result becomes a boost control signal together with no mask (S24: NO, no boost). At this time, if there is no matching material in the

mask database

432 or 8a, or if the mask database 8a cannot be accessed, the default boost amount is used.

As described above, according to the call device and the call voice processing method of the second embodiment, the effect similar to that of the first embodiment can be obtained, and the boost amount corresponding to the mask material can be set, making it easier for the listener to hear. A call can be made.

[Third Embodiment]
A third embodiment will be described with reference to FIGS. 10 to 12. FIG. In the third embodiment, a microphone is used as a caller status collection device that collects caller status information to detect whether or not the caller is wearing a mask. This is an embodiment using the voice collected by the microphone). FIG. 10 is a flow chart showing a call voice processing method according to the third embodiment. Each step of the processing flowchart is executed by the processor 30 of the smartphone 1 .

　In the flowchart of FIG. 10, the same numbers (S10 to S17) are assigned to steps that perform the same processing as in the flowchart of FIG. S30 is the speech analysis process consisting of S31-S37. The speech analysis process S30 may be a multi-process that is executed in parallel with each step of the call (S15, S16).

The speech analysis process S30 is started at the timing immediately after the caller starts the call (S15) (connected to the other party of the call by means of telecommunication). The mask detection unit 4220 acquires the voice during a call (S31), and extracts vowels such as "AIUEO" (S32). Next, a consonant such as "katasafusu" is extracted (S33). A level ratio between vowels and consonants is obtained in a high-low level ratio (=low-frequency level/high-frequency level) calculation step (S34), and a class of the high-low level ratio is determined (S35).

FIG. 11 is an explanatory diagram showing classifying processing of speech analysis processing. In the classifying process, the high-low level ratio data are classified into several classes. The data in FIG. 11 in which the level ratio and the class are associated is called level ratio class data. In FIG. 11, the audio attenuation amount is further associated with the level ratio class data. Therefore, when the level ratio of the call voice is determined and the class to which it belongs is determined, the voice attenuation associated with that class is determined. Therefore, the step of determining the level ratio class corresponds to the speech attenuation amount estimation step.

In FIG. 11, the vertical axis is the high/low level ratio, and the horizontal axis is the frequency. Plot how often the calculated high-low level ratio data occurred. From the plotted data group, the boundaries and representative values of classes 0 to 3 are determined. Class 0 is unmasked data with the smallest high-to-low level ratio, and class 3 is the class with the largest high-to-low level ratio. The ratios of the representative value of class 0 and the representative values of classes 1 to 3 are boost amounts 1 to 3, respectively. Classification may be based on an unsupervised learning process that clusters data using machine learning. At this time, it is classified into a plurality of classes corresponding to the material of the mask. The ratio of high and low levels varies due to individual differences and the degree of tightness of wearing a mask. Therefore, if the high-to-low level ratio itself is made to correspond to the amount of boost, the amount of boost may become unstable and the difficulty of hearing speech may not be improved. Classification will reduce the effects of individual differences and the degree of closeness of wearing a mask.

The mask detection unit 4220 calculates the amount of boost according to the class and obtains the boost control signal. A boost control signal is sent to the boost control step of S16. In addition to immediately reflecting the change in class in the boost control signal, it is also possible to reduce frequent changes in the amount of boost by reflecting changes over time.

The mask detection unit 4220 updates the class data 434 in the data storage area 430 of FIG. Acquisition of voice during a call is performed not only during one call but also during multiple calls. In a plurality of calls, speech in a plurality of forms, such as without masking and with masking, is acquired and used as class data. At this time, the registration voice data 433 may be referenced to determine whose voice the voice is, and the class data 434 may be handled for each voice.

According to this embodiment, when the mask detection unit 4220 uses voice analysis processing to detect the presence or absence of a mask, it is not necessary to capture the caller's face with the in-camera 72 . Therefore, for example, while driving a car, the communication device is placed in a storage box, the caller wears a headset with a microphone and earphones, and the communication device and headset are paired and used by proximity communication. However, automatic mask detection can be performed.

FIG. 12 is a flow chart showing the flow of processing that uses both image analysis processing and audio analysis processing for mask detection processing.

When the call voice processing method is activated (S11: YES or S12: YES), the image analysis process (S20) is executed, and the voice analysis process (S30) is executed during the call after the start of the call (S15). As a result, a mask detection process is achieved that satisfies both characteristics of immediacy of the image analysis process (S20) and responsiveness to changes due to the sound analysis process (S30).

As described above, according to the third embodiment, mask detection processing is possible even in situations where the in-camera 72 is not effective by using voice analysis processing. It is also possible to respond to changes such as wearing and removing a mask during a call. Furthermore, by using it together with image analysis processing, immediacy and responsiveness to changes can be satisfied.

[Fourth embodiment]
A fourth embodiment will be described with reference to FIGS. 13 to 15. FIG. In the fourth embodiment, a head-mounted display (also referred to as HMD) is used as the calling device. In the fourth embodiment, the microphone 107 provided in the HMD 100 is used as the caller state collection device, and the call voice collected by the microphone 107 is used as the caller state information.

　Fig. 13 is an external view of the HMD 100, which is one form of a communication device. HMD 100 comprises a frame housing including left temple 130A, right temple 130B, front frame 130C, and nose pads 130D. The front frame 130C includes a camera 101, a distance measuring sensor 102, a left projector 104A, a right projector 104B, and a screen 104C. Left temple 130A is provided with left speaker 106A and microphone 107 . A controller 103 is provided on the right temple 130B. A caller wears the HMD 100 on his or her face using the frame housing. It should be noted that the placement location of each component may be different from that shown in FIG. 13 .

The left projector 104A, right projector 104B, and screen 104C form an image display unit, but instead of these, a transflective screen or a non-transmissive display may be used. In the semi-transmissive screen, the user sees the front background through the semi-transmissive screen, but in the non-transmissive display, the camera image capturing the front background is displayed on the non-transmissive display for confirmation. .

The camera 101 is attached to capture an image of the background in front of the user's line of sight, and the distance sensor 102 measures the distance to the background.

FIG. 14 is a hardware configuration diagram of the HMD 100. FIG. The HMD 100 includes a camera 101, a distance measurement sensor 102, a left projector 104A, a right projector 104B, a transflective screen 104C, operation keys 105, a left speaker 106A, a right speaker 106B, a microphone 107, and a storage 140 built into the controller 103. , a GPS receiver 151 , a geomagnetic sensor 152 , an acceleration sensor 153 , a gyro sensor 154 , a LAN communication device 161 , a mobile communication device 162 and a short-range wireless communication device 163 .

The storage 140 includes a ROM 141, a nonvolatile memory 142, and a RAM 143.

The controller 103 captures the camera image captured by the camera 101 and the distance image measured by the distance sensor 102 and supplies them to the storage 140 and processor 113 inside the controller 103 . Further, the controller 103 incorporates a GPS receiver 151, a sensor group such as a gyro sensor, and a communication unit. Further, the controller 103 creates images to be projected by the left projector 104A and right projector 104B and sounds to be output to the left speaker 106A and right speaker 106B.

FIG. 15 is a block diagram showing programs and data stored in the nonvolatile memory 142 of the HMD 100. FIG. The nonvolatile memory 142 includes a basic operation program 521 , HMD program 522 , call program 523 and data storage area 524 . The processor 113 develops the basic operation program 521, the HMD program 522, and the call program 523 in the RAM 143 and executes them in the processor 113. FIG. The data storage area 524 stores data necessary for executing the basic operation program 521 , the HMD program 522 , and the call program 523 . The call program 523 includes a mask detection unit 4220, a boost control unit 4221, and a call processing unit 4222, as in the first embodiment.

The processor 113 executes the function of the mask detection section 4220 when the call processing section 4222 detects a signal of pressing the call application icon 22 or an incoming call signal. The mask detection unit 4220 analyzes the speech from the microphone 107 and outputs the result of mask detection to the boost control unit 4221 . The boost control unit 4221 outputs a boost control signal to the booster 95 according to the mask presence/absence detection result. The booster 95 performs high-frequency boost processing on the call voice acquired from the microphone 107, and outputs and transmits the result from the call processing unit 4222 to the communication device.

FIG. 16 is a diagram for explaining a call scene using the HMD 100. FIG. The caller 2 wears the HMD 100 and images the front background with the camera 101 of the HMD 100 . Also, the distance sensor 102 measures the distance to the background in front. These data are sent to the HMD management server 9 connected to the wide area network 7 .

The administrator 191 shares these data with the information communication device 192. If the front background is, for example, a facility maintenance site, the manager 191 gives voice and image instructions to the caller 2 .

The image is a three-dimensional AR (Argument Reality) object, and is three-dimensionally displayed multiplexed so that it touches the facilities in the background in front of the HMD 100.

Caller 2 responds to the manager's 191 instructions. One form of response is voice. Questions about instructions are also given by voice. The voice of the caller 2 is performed using a wireless LAN by voice communication using IP protocol. The voice communication always maintains a communication state without providing a calling procedure such as dialing. As long as the communication state exists, the call program is executed, mask detection processing and boost control according to the result thereof are executed. It goes without saying that voice analysis processing can be applied as mask detection processing, but image analysis processing may also be applied by adding an in-camera that captures the face of the caller to the HMD 100 . Alternatively, the camera of the smartphone may be used by pairing with a device such as a smartphone.

As described above, according to the fourth embodiment, an HMD can be applied as a communication device.

The present invention is not limited to the embodiments described in FIGS. 1 to 16 above, and part of the configuration of one embodiment can be replaced with another embodiment. It is also possible to add the configuration of another embodiment to the configuration of one embodiment. All of these belong to the scope of the present invention, and numerical values, messages, etc. appearing in texts and drawings are only examples, and even if different ones are used, the effect of the present invention is not impaired.

In addition, communication information terminals are used not only by the owner of the terminal, but also by acquaintances. In that case, the caller is identified by face recognition and voice recognition during the call, the class data to be applied to each individual is prepared for multiple people, and the optimal voice correction is performed for the caller, thereby realizing a further improved call. can do.

Also, in the above description, the smartphone 1 and the HMD 100 are used as communication devices, but devices to which the present invention can be applied are not limited to the smartphone 1 and the HMD 100.

For example, the call voice processing method may be applied to a web conference system in which a camera and microphone are connected to a personal computer. In this case, the call program 422 is installed in the personal computer in advance. Then, the caller positioned in front of the microphone is imaged by a camera, whether or not the caller is wearing a mask is detected based on the captured image data, and boost control similar to that described above is executed according to the detection result.

Also, when participating in a web conference with the camera turned off, the processor of the personal computer performs audio analysis processing based on the audio data collected by the microphone to detect whether the mask is worn or not. Boost control may be performed accordingly. In a Web conference, participants can participate with the camera turned on or off according to their wishes. It may be executed by switching to processing.

In this way, the present invention can be applied to any device that connects at least one of a camera or a microphone, a processor with a call function, a storage that stores the call program 422, and a controller that includes a booster.

In addition, in the smartphone 1, the in-camera 72 is used as the camera for imaging the caller, but in the Web conference system, the camera for imaging the caller is a web camera that is detachably connected to the personal computer, or is formed integrally with the personal computer. A built-in camera may be used, and the aspect of the camera that captures the image of the caller is appropriately selected according to the device to which the present invention is applied.

In addition, some or all of the functions, etc. of the invention may be implemented in hardware, for example, by designing them in an integrated circuit. It may also be implemented in software by a microprocessor unit, CPU, etc. interpreting and executing an operating program. Moreover, the implementation range of software is not limited, and hardware and software may be used together.

1, 1A:

Smartphone

2, 2A: Caller 3:

Mask

4, 4A: Mobile communication base station 5:

Mobile communication network

6, 6A: Access point 7: Wide area network 8: Call service server 8a: Mask database 9: HMD management Server 20: Home screen 20a: Dial screen 20b: Display screen 21: Icon button group 22: Call application icon 23: Numeric keypad 24: Dial start button 25: Call acceptance button 26: Call rejection button 27: Information 30: Processor 31: System Bus 40: Storage 41: ROM
42: non-volatile memory 43: RAM
51: GPS receiver 52: geomagnetic sensor 53: acceleration sensor 54:

gyro sensor

60, 60A: wireless LAN signal 61: LAN communication device 62: mobile communication device 63: short-range wireless communication device 71: display 72: in-camera 73: Out camera 81 : Microphone 82 : Speaker 83 : Call port 91 : Touch sensor 92 : Operation key 95 : Booster 100 : HMD
101: Camera 102: Ranging sensor 103: Controller 104A: Left projector 104B: Right projector 104C: Screen 105: Operation key 106A: Left speaker 106B: Right speaker 107: Microphone 108: Bus 113: Processor 130A: Left temple 130B: Right Temple 130C: Front frame 130D: Nose pad 140: Storage 141: ROM
142: non-volatile memory 143: RAM
151 : GPS receiver 152 : Geomagnetic sensor 153 : Acceleration sensor 154 : Gyro sensor 161 : LAN communication device 162 : Mobile communication device 163 : Near field communication device 191 : Manager 192 : Information communication device 421 : Basic operation program 422 : Call program 430 : Data storage area 431 : Registered face data 432 : Mask database 433 : Registered voice data 434 : Class data 521 : Basic operation program 522 : HMD program 523 : Call program 524 : Data storage area 4220 : Mask detector 4221 : Boost control unit 4222: call processing unit

Claims

A communication device using telecommunication,
The communication device includes a microphone, a communication device that transmits and receives call voice, a booster that boosts the high frequency level in the communication range, and a call that collects caller state information to detect whether the caller is wearing a mask. a party state collection device; and a processor coupled to each of the microphone, the communicator, the booster, and the caller state collection device;
The processor
a mask detection step of analyzing the caller state information and detecting whether or not the caller is wearing a mask;
a boost control step of outputting to the booster a boost control signal for controlling a high-frequency boost of the call voice collected by the microphone, according to the detection result of whether or not the caller is wearing a mask;
a transmitting step of transmitting, from the communication device, the call voice after the boost control processing by the booster;
A communication device characterized by executing
The communication device according to claim 1,
The caller state collection device is a camera, the caller state information is imaging data captured by the camera,
The processor
In the mask detection step, face recognition processing is performed on the imaging data, a face image of the caller in which the face of the caller is captured is recognized, and a mask image in which the mask is captured in the face image of the caller is recognized. Detecting whether or not the mask is worn based on whether or not can be detected,
A communication device characterized by:
The communication device according to claim 2,
The processor
Referencing a mask database that associates the type of mask with the amount of sound attenuation when wearing the mask, and further performing a sound attenuation amount estimation step of estimating the sound attenuation amount corresponding to the mask image,
In the boost control step, determining a boost amount based on the estimated audio attenuation amount and outputting the boost control signal;
A communication device characterized by:
The communication device according to claim 2,
The processor
When the communication device transmits a call request from the communication device, or when the communication device receives a call request from another communication device and the communication device starts responding, mask detection processing in the mask detection step is started. ,
A communication device characterized by:
The communication device according to claim 4,
The processor
In the mask detection step, the face image of the caller at the start of the mask detection process is registered, new image data is generated by the camera while the call is continuing, and face recognition is performed on the new image data. When the processing is executed to recognize a new face image, it is determined whether or not the new face image matches the registered face image of the caller. re-detect whether or not the caller is wearing a mask based on the face image of the caller, and continue to use the detection result of whether or not the caller is wearing a mask based on the face image of the caller.
A communication device characterized by:
The communication device according to claim 1,
The caller state collection device is the microphone, the caller state information is call audio collected by the microphone,
The processor
in the mask detection step, detecting the low-frequency level and the high-frequency level in the communication sound band of the call sound collected by the microphone, calculating a level ratio between the low-frequency level and the high-frequency level; Detecting whether or not the mask is worn based on the level ratio;
A communication device characterized by:
The communication device according to claim 6,
The processor
In the mask detection step, referring to level ratio class data obtained by classifying level ratios in advance, determining a class to which the level ratio of the call voice collected by the microphone belongs,
In the boost control step, determining a boost amount according to the class to be determined and outputting the boost control signal;
A communication device characterized by:
The communication device according to claim 2,
further using the microphone as the caller state collection device;
further using call voice collected by the microphone as the caller state information;
The processor
In the mask detection step, when the camera is activated, face recognition processing is performed on the imaged data, a face image obtained by capturing the face of the caller is recognized, and the face image is masked. Detects whether or not the mask is worn based on whether or not the captured mask image can be detected, and when the camera is activated, the low frequency level and the high frequency of the call voice collected by the microphone level, calculate the level ratio between the low-frequency level and the high-frequency level, and detect whether or not the mask is worn based on the level ratio;
A communication device characterized by:
The communication device according to claim 1,
The communication device transmits and receives call data including the call voice via an IP network.
A communication device characterized by:
A call voice processing method executed by a processor provided in a call device using telecommunication,
The communication device includes a microphone, a communication device that transmits and receives call voice, a booster that boosts the high frequency level in the communication range, and a call that collects caller state information to detect whether the caller is wearing a mask. a party state collection device; and a processor coupled to each of the microphone, the communicator, the booster, and the caller state collection device;
the processor
Boost control for analyzing the state information of the caller to detect whether the caller is wearing a mask, and controlling a high frequency boost of the call voice collected by the microphone according to the detection result of the mask wearing. A step of outputting a signal to the booster and causing the communication device to transmit the call voice after boost control processing by the booster,
A call voice processing method characterized by: