J Am Acad Audiol 2020; 31(04): 271-276
DOI: 10.3766/jaaa.19018
Research Article
Thieme Medical Publishers 333 Seventh Avenue, New York, NY 10001, USA.

Spatial Release from Masking Using Clinical Corpora: Sentence Recognition in a Colocated or Spatially Separated Speech Masker

Grant King
1  Department of Otolaryngology/Head and Neck Surgery, University of North Carolina at Chapel Hill, School of Medicine, Chapel Hill, NC
,
Nicole E. Corbin
2  Division of Speech and Hearing Sciences, Department of Allied Health Sciences, University of North Carolina at Chapel Hill, School of Medicine, Chapel Hill, NC
,
Lori J. Leibold
3  Center for Hearing Research, Boys Town National Research Hospital, Omaha, NE
,
Emily Buss
1  Department of Otolaryngology/Head and Neck Surgery, University of North Carolina at Chapel Hill, School of Medicine, Chapel Hill, NC
› Author Affiliations
Funding This work was supported by the National Institute of Deafness and Other Communication Disorders (NIDCD, R01 DC000397). [Figure 1] was based in part on a graphic developed by Kellie Halloran.
Further Information

Address for correspondence

Emily Buss
Department of Otolaryngology/Head and Neck Surgery, The University of North Carolina at Chapel Hill
Chapel Hill, NC 27599

Publication History

Publication Date:
15 April 2020 (online)

 

Abstract

Background Speech recognition in complex multisource environments is challenging, particularly for listeners with hearing loss. One source of difficulty is the reduced ability of listeners with hearing loss to benefit from spatial separation of the target and masker, an effect called spatial release from masking (SRM). Despite the prevalence of complex multisource environments in everyday life, SRM is not routinely evaluated in the audiology clinic.

Purpose The purpose of this study was to demonstrate the feasibility of assessing SRM in adults using widely available tests of speech-in-speech recognition that can be conducted using standard clinical equipment.

Research Design Participants were 22 young adults with normal hearing. The task was masked sentence recognition, using each of five clinically available corpora with speech maskers. The target always sounded like it originated from directly in front of the listener, and the masker either sounded like it originated from the front (colocated with the target) or from the side (separated from the target). In the real spatial manipulation conditions, source location was manipulated by routing the target and masker to either a single speaker or to two speakers: one directly in front of the participant, and one mounted in an adjacent corner, 90° to the right. In the perceived spatial separation conditions, the target and masker were presented from both speakers with delays that made them sound as if they were either colocated or separated.

Results With real spatial manipulations, the mean SRM ranged from 7.1 to 11.4 dB, depending on the speech corpus. With perceived spatial manipulations, the mean SRM ranged from 1.8 to 3.1 dB. Whereas real separation improves the signal-to-noise ratio in the ear contralateral to the masker, SRM in the perceived spatial separation conditions is based solely on interaural timing cues.

Conclusions The finding of robust SRM with widely available speech corpora supports the feasibility of measuring this important aspect of hearing in the audiology clinic. The finding of a small but significant SRM in the perceived spatial separation conditions suggests that modified materials could be used to evaluate the use of interaural timing cues specifically.


#

Introduction

One of the most common complaints among adults with hearing loss is difficulty understanding speech in the presence of competing sounds (Takahashi et al, 2007[22]). Speech recognition in a speech masker is more predictive of real-world listening abilities than speech recognition in noise or in quiet (Hillock-Dunn et al, 2015[13]), particularly when the target and masker are spatially separated (Vannson et al, 2015;[23] Phatak et al, 2019[17]). When the target and masker are spatially separated, binaural cues can improve speech recognition. Although the ability of listeners with hearing impairments to use binaural cues is highly variable (Gallun et al, 2013[9]), there is currently no standard clinical assessment ofbinaural hearing on speech perception in complex listening environments. Several groups have developed tests for evaluating these abilities over headphones (Cameron and Dillon, 2007;[7] Jakien et al, 2017[14]), but these tools are not designed for clinical testing in the free field, and as such are not appropriate for evaluating performance in hearing aid and cochlear implant users. The present study evaluated spatial release from masking (SRM) among five widely available speech perception tests with speech maskers to assess whether they could be used in the clinical assessment of binaural masked speech perception in the free field.

Masked speech recognition is often characterized in terms of a combination of informational masking and energetic masking (Brungart et al, 2001[5]). Energetic masking occurs when the masker interferes with peripheral encoding of the target sound. Steady-state maskers, such as speech-shaped noise, are often described as exerting primarily energetic masking. Informational masking, on the other hand, occurs even when target speech is well represented in the peripheral auditory system, but the listener is unable to recognize that target because of a failure to segregate it from the background (Bregman, 1990;[4] Brungart et al, 2001[5]). Maskers composed of a small number of perceptually similar (e.g., same gender) talkers are often described as exerting substantial informational masking (Freyman et al, 2001;[8] Rosen et al, 2013[18]).

Another factor influencing masked speech recognition is spatial separation between the target and masker. In listeners with normal hearing, spatially separating the target and the masker on a horizontal plane improves performance, particularly when the masker is composed of speech (Freyman et al, 2001;[8] Gallun et al, 2005[10]). This benefit is attributed to two factors. First, binaural difference cues, including interaural time and level differences, support auditory stream segregation by introducing perceptual differences between target and masker stimuli. Second, the presence of the listener's head in the sound field introduces differences in the signal-to-noise ratio (SNR) across ears, a phenomenon often described as the head shadow effect. Under these conditions, listeners are thought to rely solely on information presented to the ear with the better SNR. As a result, the head shadow effect is described as reflecting monaural rather than binaural hearing abilities. The benefit associated with spatially separating the target and masker is referred to as SRM, and it is quantified as the improvement in performance when the masker is moved from the target location to another point on the horizontal plane.

Although SRM is often measured by changing the physical location of the masker source relative to the target, it is possible to change the perceptual location of the masker source while playing both the target and the masker from a pair of speakers. This is accomplished by introducing small timing differences between the stimuli delivered to the two speakers. For delays on the order of 4 msec, the sound source is perceived to be the speaker associated with the earlier arriving stimulus. This phenomenon, known as the precedence effect, allows us to measure the SRM associated with purely interaural temporal cues, in the absence of the head shadow effect (Freyman et al, 2001[8]). Given that the head shadow effect relies on monaural cues, measurement of SRM using the precedence effect (perceived spatial separation) may provide a better indication of a listener's binaural hearing than real spatial separation.

Susceptibility to masking and the ability to benefit from spatial separation appear to be important indicators of performance in real-world listening. For example, adults report greater difficulty recognizing speech in the context of competing speech than speech in noise (Agus et al, 2009[1]). Unilateral hearing loss exacerbates this effect (Vannson et al, 2015[23]), presumably because of the elimination of binaural cues. Despite the ecological importance of SRM, tests to quantify it are not included in the standard audiological test battery. Thus, the present study was carried out to answer two questions. First, can currently available materials for assessing sentence recognition in a speech masker be used to assess SRM in the free field? Second, how do measures of SRM differ across commercially available tests of speech-in-speech recognition? The contributions of monaural and binaural cues were evaluated by manipulating the target and masker location in two ways: by routing the target and masker to one of two speakers and via the precedence effect.


#

Methods

Participants were 22 native American Englishspeaking adults (20-35 years) with no known history of ear disease. All had normal hearing thresholds (#20 dB HL) measured at octave frequencies 250–8000 Hz (ANSI, 2010[2]), bilaterally. Participants provided informed consent, and this research was approved by the Institutional Review Board overseeing biomedical research at the University of North Carolina at Chapel Hill.

Stimuli were taken from five commercially available sentence recognition tests: the quick speech-in-noise test (QuickSIN) (Auditec, St. Louis, MO; Killion et al, 2004[15]), speech-in-noise test using sentences developed by Bamford (BKB-SIN) (Etymotic Research, Elk Grove Village, IL; Bench et al, 1979[3]), adult speech recognition test developed at Arizona State University (AzBio) (Spahr et al, 2012[21]), pediatric speech recognition test developed at Arizona State University (Pediatric AzBio) (Spahr et al, 2014[20]), and Perceptually Robust English Sentence Test Open-set (PRESTO) (Gilbert et al, 2013[12]). These tests were chosen because they use speech maskers and were deemed likely to cause some degree of informational masking. The stimuli and typical procedures for evaluating speech recognition using these stimuli are described in [Table 1].

Table 1

Names, Stimulus Features, and Procedures Typically Used for Administering the Sentence Corpora Used in the Present Study

Corpus

Stimuli

Test Procedure

QuickSIN

Target: IEEE[*] sentences spoken by a female talker

Masker: four-talker babble

Recordings use a fixed target level, and masker level is varied in steps of 5 dB to produce a range of SNRs (25 to 0 dB SNR). Results are typically reported as the SNR associated with 50% keywords correct. Each list contains 6 sentences and a total of 30 keywords.

BKB-SIN

Target: BKB[] sentences, spoken by a female talker

Masker: four-talker babble

Recordings use a fixed target level, and masker level is varied in steps of 3 dB to produce a range of SNRs (21 to 0 dB SNR). Results are typically reported as the SNR associated with 50% keywords correct. Each list contains 2 sets of 8 sentences, with a total of 62 keywords.

AzBio

Target: sentence of 3–12 words, spoken by four talkers (2F, 2M)

Masker: ten-talker babble

Performance is typically assessed at a fixed SNR. All words are scored. Each list contains 20 sentences and a mean of 142 words.

Pediatric AzBio

Target: sentences of 3–12 words, spoken by a female talker

Masker: 20-talker babble

Performance is typically assessed at a fixed SNR. All words are scored. Each list contains 20 sentences and a mean of 138 words.

PRESTO

Target: TIMIT[] sentences produced by talkers differing in age, gender, and dialect

Masker: six-talker babble

Performance is typically assessed at a fixed SNR. Keywords are scored. Each list contains 18 sentences and a total of 76 keywords.

* Rothauser et al (1969).


Bench et al (1979).


Garofolo et al (1993).


Stimuli were played from a CD, with the target in one channel and the masker in the other, and routed to a real-time processor (RP2; Tucker-Davis Technologies, Alachua, FL) running a custom circuit (RPvds; Tucker-Davis Technologies) at 24414 Hz. This circuit scaled the masker as required to produce the desired SNR. In the real spatial manipulation conditions, the circuit routed the target and masker to the appropriate speaker(s). In the colocated condition, both stimuli were presented from the front speaker; in the spatial separation condition, the target was presented from the front speaker and the masker was presented from the right speaker (at 90°). In the perceived spatial manipulation condition, both the target and masker were routed to both speakers, but the circuit applied a 4-msec delay to the target in one speaker and the masker in one speaker. In the colocated condition, the undelayed streams of the target and masker were presented from the front (delayed streams from the right); in the spatial separation condition, the undelayed target stream was presented from the front (delayed target from the right), and the undelayed masker stream was presented from the right (delayed masker from the front). Stimuli were amplified (Crown D-75, Los Angeles, CA) and played out from two loudspeakers (1700-2002; Grason-Stadler, Eden Prairie, MN) mounted in adjacent corners of a sound booth (6′ × 6′3″). As illustrated in [Figure 1], participants were seated approximately 3′7″ from the grill of each speaker, with one positioned directly in front (0°) and the other at the right (90°).

Zoom Image
Fig. 1 Diagram of the test environment. Participants sat equidistant from the two speakers, facing the speaker in the left corner.

The target and masker for each test were calibrated separately using noise samples matched to the longterm spectral shape of the test stimuli. Calibration in the sound field was performed using a microphone placed at the location of the listener's head and the sound level meter set to A weighting (800B; Larson Davis, Provo, UT). Target sentences were presented at 50 dBA.

Procedure

Participants were divided into two groups of 11 each: one group completed testing with real spatial manipulations, and the other completed testing with perceived spatial manipulations. Corpora were presented in random order. Within each corpus, the selection and order of sentence lists were randomized for each participant. Participants were instructed to listen to the target sentences and repeat them back, guessing when necessary. A lapel microphone affixed to the participant's collar picked up their responses and routed them to a headset worn by a tester seated outside the booth. The tester scored responses in real time. Participants completed this study in one two-hour visit with a short break.

Three of the corpora were designed for presentation at a fixed SNR (AzBio, Pediatric AzBio, and PRESTO), and two were designed for presentation at a series of descending SNRs (QuickSIN and BKB-SIN). Participants heard three practice sentences at a clearly suprathreshold level before data collection for each corpus and spatial condition. For both types of stimuli, data collection entailed measuring performance for SNRs associated with 3070% words correct, with SNRs adjusted for individual participants as necessary to achieve the desired distribution of scores; the selection of SNRs was guided by pilot data and results of previous participants. For the fixed SNR tests, at least three lists were completed for each condition (colocated and spatially separated). Testing continued until scores included at least one value between 20% and 50%, and at least one value between 50% and 80%. Three lists contain a mean of 426 words for AzBio, a mean of 414 words for Pediatric AzBio, and 228 words for PRESTO. For the descending SNR tests, participants heard four sentences at each SNR in each condition; this was achieved by running four lists for QuickSIN (120 keywords) and two lists for the BKB-SIN (124 keywords) for each condition. Testing continued until at least one data point was #30% and one was >70%. The colocated condition was tested before the separated condition for every corpus.


#

Analysis

Word-level data were fitted with a logit function by minimizing the sum of squared error. The mean of the function represents the speech reception threshold (SRT) associated with 50% correct. Data were analyzed using a repeated measures analysis of variance (rmANOVA) to determine whether there were significant differences in the SRT between the tests in each condition; corpus and spatial position were within-subject variables, and the spatial manipulation (real versus perceived) was a between-subject variable. Greenhouse-Geisser corrections were applied as indicated. The SRM, computed as the difference in SRTs between colocated and separated conditions, was also evaluated.


#
#

Results

Logit functions characterized the data well, with an average r2 value of 0.96 in both the real and perceived separation conditions, and a range from 0.53 to >0.99 across all participants and stimulus conditions. Mean r2 values across corpora ranged from 0.92 (PRESTO) to 0.99 (QuickSIN). For the corpora administered at a fixed SNR (AzBio, Pediatric AzBio, and PRESTO), the criterion performance distribution was achieved with three lists for most listeners in most conditions; only 5% of cases required a fourth or fifth list. Because of experimenter error, one listener heard only two lists in one condition. For the corpora administered at descending SNRs, the criterion performance distribution was achieved with four lists for QuickSIN and two lists for the BKB-SIN in all cases.

[Figure 2] shows distributions of SRTs in dB SNR for the colocated and spatially separated conditions of each speech corpus; lower SRTs represent better performance. Data collected using the real spatial manipulation are shown in the left panel, and those for perceived spatial manipulation are shown in the right panel. The order of conditions on the abscissa was determined by the mean SRT for participants tested using real spatial manipulation. When the target and masker were colocated, SRTs were similar for the real and perceived spatial manipulations. In both cases, mean SRTs ranged from approximately −5 dB SNR (BKB-SIN) to 1 dB SNR (PRESTO). When the target and masker were spatially separated, SRTs improved, but this difference was larger for the real than the perceived spatial manipulation conditions. There were also consistent differences in the SRT across corpora that were evident for both the colocated and separated conditions.

Zoom Image
Fig. 2 Distribution of SRTs in dB SNR. Circles indicate data for individual participants in the colocated and separated conditions (open and filled, respectively). Results are shown for the real spatial manipulation (left) and the perceived spatial manipulation (right). Horizontal lines indicate the median, boxes span the 25th to 75th percentiles, and vertical lines span the 10th to 90th percentiles.

These observations were confirmed using an rmANOVA with two levels of spatial manipulation (real and perceived), five levels of corpus (each of the five corpora), and two levels of spatial position of the masker relative to the target (colocated and separated). As indicated in [Table 2], all main effects and interactions were significant (p ≤ 0.006), including the three-way interaction. Simple main effects testing was carried out to compare SRTs for the colocated condition with the real and perceived spatial manipulations; the only corpus associated with a significant effect of spatial manipulation for the colocated condition was PRESTO, where scores were 1.0 dB higher for the perceived than the real spatial manipulation (p = 0.021). For the separated conditions, scores were clearly lower for the real than the perceived spatial manipulation, with difference scores ranging from 5.6 dB (PRESTO) to 8.9 dB (QuickSIN). A pair of follow-up rmANOVAs, with data for real and perceived spatial separation analyzed separately, confirmed that the interaction between corpus and spatial position of the masker was significant for both the real and perceived spatial manipulation (p ≤ 0.045). Simple main effects testing confirmed that the difference between SRTs in the colocated and separated conditions-the SRM-was significantly greater than zero for all five corpora for both real and perceived spatial manipulation (p < 0.001). It also showed that SRTs differed significantly across corpora within condition with the following exceptions: SRTs did not differ between BKB-SIN and Pediatric AzBio corpora for real target and masker separation (p = 0.164), or for QuickSIN and AzBio (p ≥ 0.067) in any condition.

Table 2

Results of an rmANOVA, with SRT as the Dependent Variable and Independent Variables of Manipulation (Real and Perceived), Corpus (BKB-SIN, Pediatric AzBio, QuickSIN, AzBio, and PReStO), and Position (Colocated and Separated)

Effect

F

df

p

Manipulation

63.16

1,20

<0.001

0.76

Corpus

413.37

2.8,55.4

<0.001

0.95

Position

309.04

1,20

<0.001

0.94

Manipulation:corpus

4.72

2.8,55.4

0.006

0.19

Manipulation:position

112.73

1,20

<0.001

0.85

Corpus:position

15.14

2.7,54.6

<0.001

0.43

Manipulation:corpus:position

8.78

2.7,54.6

<0.001

0.31

The significant interaction between corpus and spatial position of the masker for the two spatial manipulations is illustrated in [Figure 3]. The distributions of SRM are shown for each speech corpus. Participants achieved greater SRM with the real than the perceived spatial manipulation for all corpora. There also appear to be larger individual differences in SRM for the real than the perceived spatial manipulation, an effect that is due to greater variability in SRTs for the real spatial separation. Levene's test for equality of variance provide some support for this observation, with results ranging from p = 0.001 (BKB-SIN) to p = 0.124 (QuickSIN). For the real spatial manipulation, there is a trend for a negative association between SRM and SRTs in the colocated condition; lower SRTs in the baseline condition are associated with a greater masking release when the target and masker are separated. Simple main effects testing indicates that the SRM is larger for BKB-SIN than PRESTO for the real spatial manipulation (p < 0.001), but SRM does not differ significantly across corpora for the perceived spatial manipulation (p ≥ 0.869).

Zoom Image
Fig. 3 Distribution of SRM in dB, following the plotting conventions of [Figure 2]. Symbol style reflects the spatial manipulation, as defined in the legend.

#

Discussion

The primary objective of this study was to determine the feasibility of assessing spatial hearing abilities in the free field with clinically available sentence materials and speech maskers. The present dataset provides preliminary evidence that a reliable SRM can be measured in adults with normal hearing using any of the five speech-masked sentence recognition materials evaluated. Although experimental hardware was used to collect these data, the real spatial manipulation could be implemented with a CD player and a two-channel audiometer, routing the two channels to the same speaker (colocated) or different speakers (separated). Implementing the perceived spatial manipulation is not currently feasible with clinically available materials, but such materials could be developed by recording the target and masker on both channels of a CD with different delays and SNRs. Recall that improvements in performance associated with perceived spatial separation reflect the use of binaural timing cues, whereas head shadow cues are present with real separation The present study evaluated the effect of separating the masker to the right, but reorienting the participant to face the right speaker would accommodate testing for a target in the front and a masker on the left.

Another objective of the present study was to evaluate how measures of SRM differ across commercially available tests of speech-in-speech recognition. Among the five test corpora, mean SRTs were lowest for the BKB-SIN and highest for PRESTO for both the colocated and separated conditions, using both the real and perceived spatial manipulation. Variability across stimuli is not unexpected (Calandruccio et al, 2017[6]), and low speaker predictability would tend to increase the amount of acoustic information necessary to recognize sentences from the PRESTO corpus compared with the BKB-SIN corpus (Mullennix et al, 1989[16]). One might predict that high SRTs in the colocated condition would be associated with greater informational masking such that spatial separation might have a larger beneficial effect (i.e., larger SRM). However, the opposite trend was observed-less SRM for stimuli with higher SRTs in the colocated condition.

Studies have shown that binaural cues are important for listening in complex backgrounds (Vannson et al, 2015[23]), so it makes sense that adding an evaluation of spatial separation for speech-in-speech recognition would increase our ability to predict real-world hearing difficulties in populations with hearing impairments. However, further study is needed to establish protocols for administering and interpreting the results of binaural testing in populations with hearing loss. For example, it is unclear which spatial manipulation-real or perceived-is most relevant to predicting functional hearing outcomes and which corpus is most sensitive to effects of hearing loss.


#

Abbreviations

AzBio: adult speech recognition test developed at Arizona State University
BKB-SIN: speech-in-noise test using sentences developed by Bamford, Kowal, and Bench
Pediatric AzBio: pediatric speech recognition test developed at Arizona State University
PRESTO: Perceptually Robust English Sentence Test Open-set
QuickSIN: quick speech-in-noise test
rmANOVA: repeated measures analysis of variance
SNR: signal-to-noise ratio
SRM: spatial release from masking
SRT: speech reception threshold

#

No conflict of interest has been declared by the author(s).

Notes

Portions of this article were presented at the 44th Annual Scientific and Technology Meeting of the American Auditory Society, Scottsdale, AZ, March 2–4, 2017.



Address for correspondence

Emily Buss
Department of Otolaryngology/Head and Neck Surgery, The University of North Carolina at Chapel Hill
Chapel Hill, NC 27599


Zoom Image
Fig. 1 Diagram of the test environment. Participants sat equidistant from the two speakers, facing the speaker in the left corner.
Zoom Image
Fig. 2 Distribution of SRTs in dB SNR. Circles indicate data for individual participants in the colocated and separated conditions (open and filled, respectively). Results are shown for the real spatial manipulation (left) and the perceived spatial manipulation (right). Horizontal lines indicate the median, boxes span the 25th to 75th percentiles, and vertical lines span the 10th to 90th percentiles.
Zoom Image
Fig. 3 Distribution of SRM in dB, following the plotting conventions of [Figure 2]. Symbol style reflects the spatial manipulation, as defined in the legend.