Keywords
social determinants of health - geographic mapping - census tract - personalized medicine
Background and Significance
Background and Significance
The environments in which individuals live, work, and socialize greatly influence
health and well-being.[1]
[2] These factors, known as social determinants of health (SDOH), are upstream from
specific disease processes, but influence a person's chances to be healthy.[3]
[4] SDOH is a key contributor to many health disparities, which are partially responsible
for disproportionate trends of morbidity and mortality at a population level.[5]
[6]
[7] Disadvantaging SDOH such as low education, poverty, limited access to health care,
and social isolation are associated with both increased risk of developing and having
worse outcomes due to disease states such as diabetes, cardiovascular disease, and
kidney disease.[8]
[9]
[10]
[11]
[12]
[13] Identifying patients' SDOH can inform research and support patient-specific health
care needs and interventions, although it is important to consider issues of data
quality, spatial ambiguity, and population fallacy when relying on neighborhood-level
estimates.
Despite the importance of SDOH to patient health and well-being, electronic health
records (EHRs) seldom capture structured data about SDOH.[14]
[15]
[16]
[17] Factors that contribute to this issue include a lack universally agreed-upon SDOH,
lack of structured fields within the EHR, and increased workload for health care workers
who collect and input these data.[14]
[16]
One approach to inferring SDOH is to estimate based on where the patient lives.[18] Organizations such as the Agency for Healthcare Research and Quality and the Center
for Disease Control and Prevention routinely publish SDOH data delineated by census
boundaries.[19]
[20]
[21] Measuring SDOH by census boundaries, most commonly census tract, allows for granular
calculations that closely represent the community. Obtaining boundary details requires
calculations using addresses that are available in the EHR. There is tremendous heterogeneity
in the size and population between ZIP codes.[22] The U.S. Census Bureau defines smaller increments, such as census tracts and block
groups, that are more uniform in size and population.[22]
[23]
Geocoding, or converting addresses into geographical coordinates, allows researchers
to obtain neighborhood-level estimates of SDOH.[24] Geocoding is performed via two methods: offline geocoding and geocoding through
an online service. Offline geocoding software such as DeGAUSS, Nominatim, EaserGeocoder,
SAS Geocoder, ESRI ArcGIS, QGIS, and the PostGIS TIGER geocoder[25]
[26]
[27]
[28]
[29]
[30] have been available for several years, but they often require an expensive license
or come with steep learning curves. Online tools, such as Google Maps, require sharing
addresses with the service, which risks privacy concerns. Under the Health Insurance
Portability and Accountability Act (HIPAA) privacy rule, addresses and census-level
data are considered protected health information.[31] Per-address fee structures also prove costly when geocoding large datasets. Both
methods require the additional step of mapping from geographic coordinates to SDOH
to be performed separately.
Objectives
Despite the importance of SDOH for research and operational use, there remains a critical
need for a local, HIPAA-compliant geocoding platform that can be easily deployed across
an organization and available to researchers and at the point of care. We developed
POINT: an interactive, web-based, containerized, application for geocoding addresses
that can be deployed offline and available to multiple users with minimal technical
expertise. Our application supports use through both a graphical user interface (GUI)
and application programming interface (API) client to query geographic variables,
by census tract and across census years, without deploying their own solution or exposing
sensitive patient data. Integrating SDOH databases into the geocoding workflow streamlines
the process and allows for customizability to fit user needs. POINT serves as a low-cost
and scalable alternative to using a web service.
Methods
Technical Design
POINT uses Topographically Integrated Geographic Encoding and Referencing (TIGER)
Line files.[32] TIGER/Line files are maintained by the U.S. Census Bureau and contain coordinate
boundaries down to street and street number. Every census geographic area is identified
by a unique Federal Information Processing Standard (FIPS) code. The Census TIGER/Line
files are organized into key components by census county: census-designated places
or incorporated places, county subdivisions, census tracts, census block groups, topological
faces, names of each line/geographic area, line coordinates, and address ranges, of
which a separate set of shape files exist for each county.[32] Each file is downloaded, programmatically transformed, and imported in to a PostgreSQL
database for address-level mapping.[33]
An overview of the system architecture is displayed in [Fig. 1]. We loaded census boundaries into PostGIS, a geographic information system (GIS)
enabled database, to support address-level mapping into geographic coordinates, which
are then converted into census boundaries using structured query language (SQL). PostGIS,
the spatial database extension for PostgreSQL, provides robust functionality to standardize
and geocode address strings.[28] The address standardization process involves regular expression to determine the
type of address, identify address components (such as ZIP code or street name), and
parse the address into a standard data structure with each component clearly delineated.
Our geocoder platform and supporting files are available on our GitHub repository.[34]
Fig. 1 System architecture diagram showing interactions between each component of the application.
To package our software, we created a containerized system consisting of two images:
one for the database and one for our Python-based uvicorn web server.[35] We deploy the containers using Docker, a virtualization platform that facilitates
portability and reproducibility across systems and organizations.[36] A python script is included that assists with the process of importing data from
common SDOH databases, including PLACES: Local Data for Better Health, Agency for
Healthcare Research and Quality SDOH Database, CDC Social Vulnerability Index, Food
Environment Atlas, Community Resilience Estimates, Area Deprivation Index, and United
States Department of Agriculture rural–urban communicating areas (RUCA).[19]
[21]
[37]
[38]
[39]
[40]
[41]
[42] Users may add SDOH mappings using an included Python script that loads a character-delimited
file with variable values for each FIPS code (county, tract, or block group). Future
census boundaries, or other types of spatial data (such as Health Resources and Services
Administration shortage areas), can be imported by running the included PostGIS functions
or importing the shape (.shp) file(s).
To support multiple users, each geocoding job, by default, is identified with an integer
number and password to maintain privacy and security when processing sensitive patient
data. The password and job number are required to access results. Some organizations
implementing POINT may wish to disable this feature and integrate the tool with local
security resources. Addresses and a user-defined identifier are saved in the database
for the duration of the geocoding job and are deleted automatically 72 hours after
job completion or 1 hour after download. Temporary files are generated during a download,
then immediately deleted.
We made a docker-compose configuration file that gives docker instructions to pull
the docker images from the docker repository and deploy the application. Also included
is a shell script that will load the full 2021 Census TIGER dataset along with 2010
block group boundaries. It can take several hours to download and import all data,
depending on the system and download speeds, and will take up about 100 to 130 GB
of disk space. To reduce disk space, scripts are available to download data for only
a subset of states.
User Interface and System Functionality
The user interface supports two modes of access: a GUI and an API to support programmatic
queries. The API conforms to representational state transfer architecture and OpenAPI
specifications.[43] Our platform provides both geocoding (converting address to Census FIPS code) and
geovariable mapping (looking up values of variables based on Census FIPS code) functions
that can be performed either together or separately ([Fig. 2]).
Fig. 2 Workflow diagram of core functionality. Each user workflow (initiating a batch geocode
job, mapping geovariables, and viewing/downloading results) is highlighted in a different
color. Nodes with gray background represent system processes.
To geocode, the user inputs a character delimited file with columns corresponding
to an address or individual address fields. The application outputs coordinates (longitude/latitude),
census block groups FIPS codes, and geocoding scores (0–138 estimate of geocoding
accuracy/resolution [0 being an exact match]). Based on our experiments, we set the
default threshold for successful geocoding to a rating of 25 or below, but thresholds
may be adjusted for different geographic precision. After an input file is uploaded,
the user defines a password, and a “job” is created with a unique identifier so that
the user can return to check progress or download results.
The web application provides support to map geocoded addresses to a list of geographic
variables. The user can select geographic variables from a list of available measures,
based on the SDOH sources loaded into the application database. Target geographic
variables can be selected prior to geocoding as a part of the batch geocoding job
creation workflow or using files that were previously geocoded ([Fig. 2]).
System Evaluation
The PostGIS Tiger Geocoder was previously validated against a subset of the Open Addresses
dataset[44] using the bench4gis geocoding benchmarking framework with a reported 99% hit rate
(successful geocoding to geographic coordinates) and 65% accuracy within 100 m and
90% accuracy within 1 mile.[45]
We evaluated our application's performance using two address datasets. The first dataset
contained 1,000,000 nationally representative addresses sampled from Open Addresses,
which we have published online.[46] Open Addresses is a public database of street addresses and reference coordinates
collected from authoritative sources such as local GIS departments or postal services.[44] Random addresses were sampled in a population-weighted manner such that the distribution
of states in the dataset would match state populations as of the 2020 census. The
second dataset contained 3,096 patient addresses from Vanderbilt University Medical
Center (VUMC) that were previously geocoded with the official Census.gov geocoder,
which we took as gold standard.[47]
First, we geocoded both datasets with the POINT geocoder and the DeGAUSS geocoder
to evaluate overall hit rate as a function of rating. For the Open Address dataset,
we used our platform's multithreading feature to improve efficiency (4 threads). No
equivalent feature was available for the DeGAUSS geocoder. To evaluate geocoder accuracy,
we compared concordance in assigned census block group, tract, and county between
output from the POINT geocoder with reference coordinates. We also evaluated geocoder
accuracy as a function of rating. We defined an error by calculating geodesic distances
between coordinates returned by the POINT geocoder and reference coordinates. We visualized
the difference between calculated geocodes and reference coordinates using choropleth
maps generated using the plotly package in Python version 3.9.[48] To compare geocoder accuracy rating cutoffs, we computed planar census tract areas
and compared average tract areas between urban and rural tracts, based on RUCA codes
8, 9, or 10.
Results
Our sample of the Open Addresses dataset consisted of 1,000,000 addresses from 49
of 50 U.S. States. Open Addresses does not contain addresses in New Hampshire, so
these were not represented in our sample. The VUMC addresses dataset had 3,096 total
addresses, consisting of 2,588 (83.5%) addresses from Tennessee, 249 (8.0%) addresses
from Kentucky, and 124 (4.0%) addresses from Alabama. [Table 1] compares geocoding statistics between POINT and DeGAUSS. Compared with DeGAUSS,
POINT mapped 30,474 more addresses from the Open Addresses dataset in 30% of the time
(31 vs. 103 h). Among successful mappings across both geocoders, performance was similar
with a median (interquartile range) distance of 52.5 (26.5–119.4) and 14.5 (10.9–24.6)
m from reference for the Open Addresses and VUMC datasets, respectively.
Table 1
Comparison of geocoder accuracy and runtimes
|
POINT geocoder
|
DeGAUSS
|
Open addresses
|
|
|
Successful geocodes (%)
|
994,146 (99.4)
|
963,672 (96.4)
|
Median error, m (IQR)
|
52.5 (26.5–119.4)
|
54.7 (29.8–113.5)
|
Runtime
|
31 h
|
103 h
|
VUMC dataset
|
|
|
Successful geocodes (%)
|
3,089 (99.8)
|
3,058 (98.8)
|
Median error, m (IQR)
|
14.5 (10.9–24.6)
|
15.5 (9.6–30.1)
|
Runtime
|
17 min
|
21 min
|
Abbreviations: IQR, interquartile range; VUMC, Vanderbilt University Medical Center.
Out of the addresses in the VUMC dataset, 2,907 (93.4%), 2,942 (95.0%), and 3,034
(98.0%) were concordant between the Census.gov and POINT results for census block
group, tract, and county levels respectively. [Table 2] provides a breakdown of accuracy at the census block group, tract, and county levels
by RUCA codes for the Open Addresses dataset, where reference coordinates are available.
Among successfully geocoded addresses that could be mapped to RUCA codes, POINT geocoded
888,192 (89.4%), 903,256 (90.9%), and 965,955 (97.2%) addresses to the same census
block group, tract, and county levels, respectively. We visualize the difference in
census tract concordance as a function of county in [Fig. 3]. [Table 3] provides detailed accuracy metrics for the POINT geocoder across both datasets.
Our geocoder achieved the best-possible accuracy rating of 0, which corresponds to
an exact match, in 53.7% of addresses across both datasets. A total of 921,992 (91.8%)
addresses were successfully geocoded within our default rating cutoff of 25. Similarly,
at a rating cutoff of 25, 63.2% of Open Addresses and 72.4% of VUMC addresses were
within 500 m of the reference coordinates. Hit rates across levels of geographic precision
are available in [Supplementary Table S1] (available in the online version). We include comparison of geocodes calculated
from DeGAUSS and POINT in [Supplementary Table S2] (available in the online version). Among all successfully geocoded addresses, 31,081
(3.1%) from the Open Addresses dataset and 175 (5.7%) from the VUMC address dataset
were identified as rural residences. Specific census tract areas and average tract
areas by county are included in [Supplementary Table S3] (available in the online version).
Table 2
Geocoding accuracy at various census divisions for Open Addresses dataset by rural–urban
communicating area codes
RUCA code
|
Frequency, n (%)
|
Correct county, n (%)
|
Correct tract, n (%)
|
Correct block group, n (%)
|
Overall
|
993,614
|
965,955 (97.2)
|
903,256 (90.9)
|
888,192 (89.4)
|
1 (metropolitan area core)
|
791,851 (79.7)
|
778,554 (98.3)
|
728,920 (92.1)
|
718,924 (90.8)
|
2 (high commuting to metropolitan area)
|
88,423 (8.9)
|
82,663 (93.5)
|
77,186 (87.3)
|
75,225 (85.1)
|
3 (low commuting to metropolitan area)
|
6,968 (0.7)
|
6,434 (92.3)
|
5,929 (85.1)
|
5,729 (82.2)
|
4 (Micropolitan area core)
|
39,868 (4.0)
|
37,883 (95.0)
|
34,754 (87.2)
|
34,011 (85.3)
|
5 (high commuting to micropolitan area)
|
14,242 (1.4)
|
13,111 (92.1)
|
12,437 (87.3)
|
12,006 (84.3)
|
6 (low commuting to micropolitan area)
|
2,685 (2.4)
|
2,373 (88.4)
|
2,243 (83.5)
|
2,174 (81.0)
|
7 (small town core)
|
18,496 (1.9)
|
16,868 (91.2)
|
15,467 (83.6)
|
14,833 (80.2)
|
8 (high commuting to small town)
|
5,126 (0.5)
|
4,634 (90.4)
|
4,365 (85.2)
|
4,217 (82.3)
|
9 (low commuting to small town)
|
2,300 (0.2)
|
2,049 (89.1)
|
1,969 (85.6)
|
1,923 (83.6)
|
10 (rural areas)
|
23,655 (2.4)
|
21,386 (90.4)
|
19,986 (84.5)
|
19,150 (81.0)
|
Abbreviation: RUCA, rural–urban communicating area.
Notes: Reference based on published geographic coordinates. A total of 6,386 (0.64%)
addresses were excluded due to inability to map to RUCA code.
Fig. 3 Choropleth maps showing POINT accuracy at the census tract level (compared with reference)
in each county for the (A) Open Addresses Dataset and (B) VUMC addresses (only Tennessee and Kentucky counties). Counties without at least
one address geocoded are indicated in gray. VUMC, Vanderbilt University Medical Center.
Table 3
Distances in meters between the POINT geocoded coordinates and published Open Addresses
coordinates or Census.gov geocoder coordinates for each rating bin
|
|
Distance from reference (m)
|
Proportion of distances below threshold
|
Rating
|
Addresses, n (%)
|
Median (IQR)
|
≤50 m
|
≤100 m
|
≤500 m
|
≤1,000 m
|
Open addresses dataset
|
0
|
538,349 (54.2)
|
41.9 (23.2–76.8)
|
57.6
|
83.0
|
98.9
|
99.2
|
5
|
75,255 (7.6)
|
58.1 (28.0–134.3)
|
44.4
|
68.0
|
91.7
|
95.5
|
10
|
165,396 (16.6)
|
58.2 (29.5–132.8)
|
44.5
|
67.8
|
90.8
|
93.7
|
15
|
86,548 (8.7)
|
57.3 (28.6–126.8)
|
44.8
|
69.0
|
90.8
|
93.6
|
20
|
35,397 (3.6)
|
84.2 (35.4–357.9)
|
34.8
|
54.4
|
77.8
|
82.5
|
25
|
18,119 (1.8)
|
158.0 (49.2–2,571.3)
|
25.4
|
41.4
|
63.2
|
68.1
|
50
|
31,670 (3.2)
|
4,013.1 (148.9–36,845.5)
|
12.1
|
20.7
|
34.8
|
39.1
|
100
|
40,955 (4.1)
|
8,039.5 (3008.8–23945.2)
|
4.2
|
6.8
|
11.2
|
13.7
|
150
|
2,457 (0.3)
|
153,082.5 (12,745.8–293,679.8)
|
0.0
|
0.2
|
1.4
|
2.8
|
VUMC addresses dataset
|
0
|
1,095 (35.4)
|
12.7 (10.4–17.8)
|
97.1
|
99.4
|
99.8
|
99.8
|
5
|
136 (4.4)
|
19.8 (13.2–57.7)
|
73.5
|
79.4
|
88.2
|
91.9
|
10
|
1,210 (39.2)
|
14.4 (10.9–23.6)
|
89.8
|
94.5
|
96.6
|
96.9
|
15
|
327 (10.6)
|
17.5 (11.8–30.5)
|
85.3
|
94.8
|
98.2
|
98.5
|
20
|
102 (3.3)
|
22.0 (12.8–39.3)
|
80.4
|
84.3
|
89.2
|
90.2
|
25
|
58 (1.9)
|
22.2 (12.6–1,141.7)
|
56.9
|
69
|
72.4
|
74.1
|
50
|
48 (1.6)
|
38.9 (13.9–9,757.7)
|
52.1
|
54.2
|
58.3
|
58.3
|
100
|
99 (3.2)
|
3,420.1 (17.2–63,914.2)
|
42.2
|
42.2
|
43.4
|
43.4
|
150
|
14 (0.5)
|
7,985.8 (4,898.3–9,689.9)
|
0
|
0
|
0
|
0
|
Abbreviations: IQR, interquartile range; VUMC, Vanderbilt University Medical Center.
Note: Percentages reported out of total hits (994,146, and 3,089, respectively).
Discussion
We developed a web-based application to enable offline, HIPAA-compliant, geocoding,
and downstream mapping to neighborhood-level variables. The POINT geocoder includes
both a GUI and API to support users across a range of technical expertise. The application
supports mapping to multiple census years and sources of neighborhood-level data,
and we've integrated a robust pipeline that allows users to incorporate additional
datasets as they become available. Our results demonstrate that POINT offers an improved
hit rate with similar accuracy to existing solutions, including DeGAUSS and the U.S.
Census Bureau's official geocoder.
Understanding community- or neighborhood-level variation is essential to evaluating
SDOH and reducing disparity in health and health care.[4]
[16] For example, community vital signs—aggregate measures of SDOH—have been proposed
as a way to integrate community-level social determinants into clinical decision support
tools.[18]
[49] These community vital signs could identify patients who may benefit from targeted
interventions, such as sending informational material on quick and easy healthy recipes
for patients who live in food deserts. They can also be incorporated into predictive
risk modeling at a population level for provider reimbursement adjustments or community-level
initiatives.[49]
[50]
[51] Integrating individual patient SDOH into the EHR can support clinical work and improve
patient engagement. Using coarsened geocodes such as census division instead of exact
patient addresses also serves to preserve individual patient privacy in research.
The POINT geocoder offers several advantages over existing geocoding applications.
First, the POINT geocoder was designed to provide free robust geocoding and SDOH mapping
capabilities to multiple users across an organization. Existing tools offer free offline
services to single users or online services to multiple users. POINT serves as an
important intermediate solution between fully offline software packages that each
user must configure on their own and an online cloud-based solution that requires
exposing sensitive data to a third party. Second, POINT provides access through both
GUI and API. Other offline tools often only support a single type of access, most
commonly through command line interface. Users with technical expertise can access
the tool programmatically and integrate it into established analytic pipelines, whereas
users who prefer a graphical interface can perform all tasks through their web browser.
At our institution, we are exploring approaches to integrate geocoding into the EHR
using the POINT API. One initiative involves geocoding addresses for patients in the
emergency department to identify opportunities for convenient follow-up close to home.
POINT provides a single robust pipeline to geocode addresses and map geocodes to SDOH
measures. Existing solutions commonly offer geocoding functionality but rely on users
to perform additional mapping to SDOH metrics. Providing geocoding and SDOH mapping
functionality in a single pipeline supports users without requiring additional technical
expertise to curate, transform, and link SDOH data. At our institution, we are experimenting
with opportunities to integrate the POINT SDOH pipeline in the EHR as part of decision
support to identify patients who may need additional support during telehealth visits.
POINT also supports scalability to multiple datasets. By default, POINT incorporates
data from the 2010 and 2020 census and multiple commonly referenced SDOH databases.
However, census boundaries change every 10 years, and new SDOH datasets are consistently
published or updated. POINT includes functionality to import new census years and
SDOH datasets.
Our experiments suggest that POINT offers performance that is consistent or superior
to existing tools. We were able to corroborate reported benchmark hit rates of 99%
with POINT yielding a hit rate of 99.4%.[45] Across census block group, tract, and county, POINT was greater than 93% concordant
for addresses in the VUMC dataset. Based on reference coordinates from Open Addresses,
we were able to obtain concordant assignments of 89.4, 90.9, and 97.2% at the census
block group, tract, and county levels respectively, with expected declines for addresses
in areas with decreased population density. Even at the most precise census division
(block group), the worst percent concordance was still above 80% in low population
density areas. The slightly worse performance for the Open Addresses dataset may reflect
lower-quality reference coordinates due to the heterogeneity of address sources in
the Open Addresses dataset. Concordance between output coordinates suggest that POINT
offers similar accuracy to other geocoders (median distances of 14.5 and 5.9 m vs.
Census.gov reference and the DeGAUSS geocoder, respectively). We hypothesize that
difference in hit rate between POINT and DeGAUSS may reflect differences in prefiltering
of poor quality geocodes. On the geographically diverse and nationally representative
Open Addresses dataset, concordance with published coordinates was similar with a
median distance of 52.5 m. Common reasons for failure include typos in the address
string and incorrectly positioned apartment numbers. Future work that advances address
string standardization beyond PostGIS functions to better detect and correct typographical
errors and ensure consistent formatting prior to geocoding may improve geocoding performance.
We recommend that users consider standardizing address strings, such as with a CASS
certified software, before using them as input for the POINT geocoder.
Geocoding with online services, such as Google Maps and OpenStreetMaps (OSM), has
been evaluated with similar methods.[52]
[53] Hit rates of 93 and 82% and median distances from reference coordinates of 9 and
175.8 m were previously reported for Google Maps and OSM, respectively.[52] Google Maps yields a slightly better median distance from reference (9 vs. 14.5
m) than POINT. However, the nationwide mean census tract area based on 2020 census
boundaries is 116.8 km2; metropolitan city cores had a median tract area of 8.0 km2. It is unlikely that the median distance from reference between Google Maps and POINT
yields significantly different tract-level results. Use of Google Maps requires exposing
addresses to a third-party server.
Spatial uncertainty and data quality are two key considerations in geocoding addresses.
One source of spatial uncertainty stems from ambiguous road network data, in that
positions for specific street/house numbers are often interpolated based on address
ranges when they are not always uniformly distributed across a given street.[54] Our analyses relied on two large datasets that separately provided addresses corresponding
to a robust national representation and detailed local representation. However, these
datasets suffer from a lack of “ground truth” geocodes and inconsistent data quality.
To address this limitation, we assessed concordance between multiple existing geocoders
that have been applied widely. In creating our evaluation dataset from Open Addresses,
we conducted a weighted sampling approach to sample respective to the population of
each state. While this approach yielded a nationally representative sample, county
representation in some states was incomplete or poor. This was likely due to how data
were collated to create the Open Addresses dataset, which used a large variety of
local sources, some of which did not provide complete data or with improperly labeled
address segments for inclusion in the evaluation set. Analysis of accuracy may also
differ significantly between established and new communities, especially those whose
street names are new, and we did not have a good method to systematically identify
newer addresses. Finally, a limitation of using overall hit rates is that a reported
successful geocode does not necessarily imply accuracy. The PostGIS geocoder, for
example, will return successful geocodes at geographic centroids of census-designated
places or ZIP codes if street number/name cannot be matched to one in the database.
With every geocoding attempt, the PostGIS geocoder returns a rating score based on
confidence.[28] While we propose 25 as a potential threshold for accuracy, alternative thresholds
may be more appropriate for different datasets, tasks, or research questions. Users
may wish to investigate appropriate cutoffs for their specific projects. For instance,
geocoding error rate increases as population density decreases.[55] This is an observation that we have redemonstrated in [Table 2].
There are several limitations to geocoding in research and operational settings. Firstly,
it is important to consider the risk of ecological fallacy when using geocoding as
a tool to estimate individual patient characteristics. Aggregate SDOH characteristics
based on home addresses may not yield representative traits of individuals. Additionally,
many SDOH measures are based on sampling of all residents of a census division, but
populations accessing health care may differ significantly from the rest of the individuals
living in their neighborhood by virtue of needing health care. Address sources themselves
can also serve as sources of spatial uncertainty. Patient addresses may simply be
incorrect. This can be due to inaccurate transcription, ambiguous addresses, or out-of-date
address records.[56] Finally, while our software package and scripts do not include non-U.S. geographic
boundaries, if GIS data are available, they can be imported programmatically into
our tool.
Conclusion
We developed an interactive, offline, web-based application to support address geocoding
and mapping geocodes to neighborhood-level variables. POINT offers a HIPAA-compliant
approach that can be easily scaled to multiple users with minimal technical expertise
on a single installation. POINT successfully geocoded a greater percentage of addresses
than existing geocoding tools. Among addresses that were successfully geocoded, we
noted concordant mappings between systems which suggests accuracy. As health systems
and researchers continue to explore and improve health equity, it is essential to
obtain, and moreover, integrate into the EHR, accurate neighborhood level variables
in a HIPAA-compliant way.
Clinical Relevance Statement
Clinical Relevance Statement
POINT is an offline geocoding solution that can support multiple users and integrates
downstream mapping to neighborhood-level variables with a pipeline that allows users
to incorporate additional datasets as they become available while protecting patient
privacy. Geocoding at the patient level can enable targeted interventions that account
for individual patient needs and circumstances based on the communities in which they
live.
Multiple-Choice Questions
Multiple-Choice Questions
-
What does it mean to geocode an address?
-
Rewrite an address in a standardized form
-
Convert the address into precise geographic coordinates
-
Transfer an address into an electronic database
-
Plot an address on a map
Correct Answer: The correct answer is option b. Geocoding refers to the process of converting addresses
from a text format (consisting of street number, street name, city, ZIP code, and
state) into precise geographic coordinates (such as longitude and latitude).
-
What are community vital signs?
-
Average heart rate, blood pressure, temperature, and respiratory rate of members of
a given community
-
Individual patient factors such as income or occupation
-
Aggregate measures of SDOH in a community
-
Average distance from a health care facility
Correct Answer: The correct answer is option c. Community vital signs are measures of SDOH derived
from neighborhood-level data. Like traditional vital signs, community vital signs
provide clinicians with key information about the social environment in which patients
live.
Supplementary Table S1
Hit rates on Vanderbilt University Medical Center dataset at different levels of geographic
precision
|
Addresses (%)
|
Street number (%)
|
Street (%)
|
City (%)
|
Rating
|
Interval
|
Overall
|
Interval
|
Overall
|
Interval
|
Overall
|
Interval
|
Overall
|
0
|
32.0
|
32.0
|
100.0
|
100.0
|
100.0
|
100.0
|
100.0
|
100.0
|
5
|
3.6
|
35.6
|
77.9
|
97.8
|
100.0
|
100.0
|
100.0
|
100.0
|
10
|
38.2
|
73.8
|
96.6
|
97.2
|
100.0
|
100.0
|
100.0
|
100.0
|
15
|
9.3
|
83.0
|
90.6
|
96.5
|
100.0
|
100.0
|
100.0
|
100.0
|
20
|
3.1
|
86.2
|
84.1
|
96.0
|
100.0
|
100.0
|
100.0
|
100.0
|
25
|
2.1
|
88.2
|
76.1
|
95.5
|
100.0
|
100.0
|
100.0
|
100.0
|
30
|
1.0
|
89.2
|
62.9
|
95.2
|
100.0
|
100.0
|
100.0
|
100.0
|
35
|
0.7
|
89.9
|
47.0
|
94.8
|
100.0
|
100.0
|
100.0
|
100.0
|
40
|
0.6
|
90.6
|
35.9
|
94.4
|
100.0
|
100.0
|
100.0
|
100.0
|
45
|
0.5
|
91.0
|
38.8
|
94.1
|
100.0
|
100.0
|
100.0
|
100.0
|
50
|
0.4
|
91.4
|
58.9
|
93.9
|
100.0
|
100.0
|
100.0
|
100.0
|
75
|
0.3
|
94.5
|
70.0
|
93.2
|
100.0
|
100.0
|
100.0
|
100.0
|
100
|
0.4
|
98.6
|
12.2
|
89.8
|
27.8
|
97.0
|
100.0
|
100.0
|
150
|
0.1
|
100.0
|
0.0
|
88.5
|
0.0
|
95.6
|
100.0
|
100.0
|
Notes: Percentages reported out of total hits (422,162/423,722 = 99.6%). Maximum rating
in the dataset was 144.
Supplementary Table S2
Distances in meters between DeGAUSS geocoder coordinates and PostGIS Tiger geocoder
coordinates for each rating bin for both datasets
|
|
Distance from reference (m)
|
Proportion of distances below threshold (%)
|
Rating
|
Addresses, n (%)
|
Median (IQR)
|
≤50 m
|
≤100 m
|
≤500 m
|
≤1,000 m
|
Open addresses dataset
|
0
|
534,482 (55.7)
|
4.2 (2.9–10.1)
|
93.5
|
95.8
|
98.9
|
99.1
|
5
|
72,946 (7.6)
|
9.8 (3.5–96.1)
|
70.2
|
75.3
|
90.3
|
93.0
|
10
|
161,975 (16.9)
|
6.3 (3.2–27.3)
|
79.3
|
83.3
|
92.6
|
94.1
|
15
|
84,153 (8.8)
|
5.1 (2.8–23.0)
|
79.8
|
83.4
|
92.4
|
93.6
|
20
|
32,672 (3.4)
|
15.0 (3.6–481.6)
|
58.7
|
64.0
|
79.3
|
82.2
|
25
|
16,241 (1.7)
|
96.6 (5.0–2,622.4)
|
45.8
|
50.3
|
66.5
|
70.6
|
50
|
26,105 (2.7)
|
1,972.2 (159.3–76,177.1)
|
21.1
|
23.2
|
28.7
|
31.0
|
100
|
29,898 (3.1)
|
8,149.3 (2,349.1–29,899.7)
|
10.4
|
11.5
|
15.3
|
17.9
|
150
|
1,609 (0.2)
|
145,468.3 (8,387.6–292,102.4)
|
0.0
|
0.0
|
1.1
|
2.2
|
VUMC addresses dataset
|
0
|
1,115 (30.3)
|
3.8 (2.7–10.0)
|
92.1
|
94.1
|
97.8
|
98.3
|
5
|
168 (4.6)
|
10.1 (3.3–181.8)
|
66.7
|
70.2
|
84.5
|
88.1
|
10
|
1,281 (34.8)
|
4.3 (2.7–10.6)
|
86.6
|
89.5
|
94.9
|
96.4
|
15
|
366 (9.9)
|
4.5 (2.9–18.3)
|
81.7
|
85.8
|
94
|
95.1
|
20
|
143 (3.9)
|
10 (3.2–309.1)
|
64.3
|
68.5
|
77.6
|
80.4
|
25
|
94 (2.6)
|
85.3 (3.6–2,119.4)
|
48.9
|
51.1
|
63.8
|
71.3
|
50
|
96 (2.6)
|
132.8 (3.7–57,151.1)
|
43.8
|
47.9
|
56.3
|
58.3
|
100
|
276 (7.5)
|
3,395.3 (989.3–12,438.2)
|
17.3
|
18.1
|
21.7
|
25.4
|
150
|
144 (3.9)
|
5,372.7 (3,490.5–10,588.5)
|
0
|
0
|
0
|
0
|
Abbreviations: IQR, interquartile range; VUMC, Vanderbilt University Medical Center.
Note: Percentages reported out of total overlapping addresses (960,061 and 3,683).
Supplementary Table S3
Mean (2020) census tract areas by rural–urban communicating area code
RUCA code (description)
|
Mean area, km2
|
Nationwide
|
116.8
|
1 (Metropolitan area core)
|
8.0
|
2 (High commuting to metropolitan area)
|
212.5
|
3 (Low commuting to metropolitan area)
|
273.5
|
4 (Micropolitan area core)
|
60.3
|
5 (High commuting to micropolitan area)
|
424.7
|
6 (Low commuting to micropolitan area)
|
348.7
|
7 (Small town core)
|
218.1
|
8 (High commuting to small town)
|
772.6
|
9 (Low commuting to small town)
|
390.2
|
10 (Rural areas)
|
1,363.1
|
Abbreviation: RUCA, rural–urban communicating area.