Let’s Talk About Bias: A Solution-Oriented Approach to Representativeness in Mobility Data

By Brennan Lake / 7 minutes

← Resource Center Home

A cross-sector need for transparency

As the public and private sectors increasingly turn to geospatial data for empirical insights on human mobility at scale, decision makers are rightfully asking an important question: does mobility data provide the full picture of a population’s movements, or is it biased towards certain segments of society?

Without a clear understanding of data representativeness, decision makers may lack confidence in relying on geospatial insights to drive business strategy in the private sector, and to set public policies and investment decisions in the public sector. These issues are heightened during times of crisis—such as the global COVID-19 pandemic—which not only led to a rapid increase in the demand for real-time data on human mobility, but also raised the stakes on the importance of ensuring that data accurately reflects reality.

With our core values rooted in Transparency, Accountability, and Innovation, Cuebiq is committed to providing the information and tools needed for decision makers to confidently identify, quantify, and rectify issues of bias in mobility data. First, we provide a comprehensive overview of how selection bias is inherently introduced in mobility data collection. Second, we demonstrate how Cuebiq users can utilize on-platform tools to assess data representativeness, while also highlighting independent, peer reviewed assessments of our data. Finally, we explore innovative approaches to course-correcting for bias in mobility data.

Identifying Bias

As with any big-data asset, passively collected mobility data is subject to selection bias. For location-based service (LBS) data collected exclusively from smartphone applications, issues of selection bias are readily apparent. Since LBS data is, by definition, collected exclusively from opted-in smartphone users, such data sources do not capture mobility patterns of users who do not own smartphones, including users of “feature” phones  (flip phones), or those who simply do not own a mobile device.

Practically speaking, this means that certain socio-demographic groups are underrepresented in smartphone-derived mobility datasets. Take senior citizens, for example. While smartphone adoption among the elderly is rapidly increasing, just 61% of Americans over 65 own a smartphone, compared to 95%-96% of Americans aged 18-49 who own a smartphone.

Conversely, while smartphone ownership is high among minors, Cuebiq intentionally does not collect data from minors, going so far as to employ methodologies for identifying and removing devices from its panel that are suspected to belong to minors. While such a measure will introduce bias into the dataset, it represents an uncompromising commitment to privacy.

Although less acute than age, there is also a correlation between smartphone ownership and income. As of 2021, smartphone ownership in the United States among those making more than $75,000 annually was 96%, while ownership among those making less than $30,000 is 76%. While 76% still represents a relatively high penetration rate, there will still be a slight selection bias towards wealthier users. In lower-income countries, where feature phones are ubiquitous but smartphone penetration remains lower, LBS data is less representative of lower income users, as compared to middle and higher income users. As smartphone ownership continues to grow across income groups, however, such biases are likely to decline.

Quantifying bias

For discerning data analysts, Cuebiq provides an on-platform application to quantify bias in mobility data. By correlating the density of inferred home areas in mobility data with publicly available census data, analysts can measure data representativeness across subsamples of the data—such as urban vs. rural or census block group vs. county—over time. In performing such analyses, platform users are able to more confidently rely on highly representative results, while also taking steps to account for caveats and even correct for bias within less representative results.

In addition to our own analyses, independent academic researchers have worked extensively with Cuebiq data through our Social Impact program. In the process, they have generated multiple peer-reviewed journal publications, which include sections on data representativeness. For example:

Rectifying bias

Once bias has been identified and quantified within a subsample of mobility data, analysts may want to take steps towards correcting such biases. For example, if lower-income users are underrepresented within a given sample of data, analysts can employ a number of methods to determine whether results for underrepresented segments are valid. These include:

  • Post stratification: Once results have been obtained for various subsegments of a data panel, multilevel regression with post stratification techniques can be used to apply weights to different segments based on how representative they are. As a result, stratification uses observable, representative portions of a data panel to estimate and reconstruct the less representative portions of the population.
  • Data fusion: LBS data is a highly sought after datasource due to its precision and accuracy, but as we know, it comes with certain selection biases due to its reliance on smartphone adoption. Alternative data sources, such as telco Call Detail Records (CDR), may be less granular, but they make up for this deficit in the sheer volume and representativeness of devices within their panel, including feature phones. By training machine learning models on highly representative CDR data, such models can predict how less represented LBS data segments would behave, thereby filling gaps in LBS data.
  • Synthetic populations: when certain segments of society, such as children, are completely absent from a dataset for privacy and ethical reasons, there may be valid use cases that necessitate an understanding of how these user segments interact with the population – such as measuring the effect of natural disasters on children. To solve these challenges, analysts can develop synthetic populations using agent based modeling to infer mobility patterns of such user segments by relying on publicly available census data, which indicates the proportion of households in given areas with children.

In addition to developing tools for our data clean room users, Cuebiq is proactively working with leading academic researchers to develop innovative methods for identifying, quantifying, and rectifying bias in mobility data. Stay tuned for additional research publications from our Social Impact community.

About the Author

Brennan Lake, VP of Social Impact and Enterprise Partnerships

As head of Cuebiq's Data for Good program, Brennan works with researchers and non-profits to improve lives through the novel use of location data. Brennan's background includes leading an international development NGO, and co-founding a SaaS platform for small businesses in Latin America.