FirstRepository

Methodical Investigation

Aravind Surumpudi

Word Count: 2030

Introduction

Solving the poverty crisis in Eastern through geo-spatial and satellite data is a lofty, yet achievable goal. The amount of technology and data we have at our fingertips has paved the way for data scientists to predict/measure poverty in ways that we have never seen before. However, there are a few salient harms that come with all this data and capability. A few of the harms that I would like to start to discuss are: security/privacy, costs, and bad data/analytics. Security and privacy have always had negative connotations associated with data science. In order for data science to solve big problems such as poverty, we need large sets of data. These datasets can include private information that is prone to identity and data theft. If ordinary people are getting hurt while trying to solve the poverty crisis, then people would be hesitant to agree to share their data thus halting the progress of data science before it even starts. Another harm is costs, the cost of storing and mining data is enormous. The money would most likely come from government funding, but this could reduce the budget in areas of development such as schools or hospitals. Of course, this is more of a financial issue than a data science issue, but it is a harm nonetheless and will cause a “one step forward and two steps back” scenario. My central research of: “Can these predictive models and mapping work in regions where poverty is not as apparent or outright?” shines light on the biggest harm of fixation (bad data/analytics). Big data and these data models have shown their prowess in predicting poverty in areas that are already highly concentrated in poverty, but what about in areas where pockets of poverty are integrated into an urban setting? Data science relies on a large set of data to construct predictive models, but in these areas, there is not much data focusing on the impoverished. These people end up getting lost in the large sea of data and bear the risk of being forgotten. When people think of poverty, they think of Africa, but Africa only consist of 40% of the global poverty. The other 60% need financial aid and our focus, and in this paper, we will assess if the 2 geospatial data science methods (Satellite imagery/RS Data and CDR) are capable of achieving this. To review, satellite imagery has been very useful in producing predictive models and has improved high frequency surveys. Datasets can include night lights and conditions of roads/homes to produce “heat maps” of poverty. CDR (Call Details Record) on the other hand uses information from cell towers, such as number of texts sent/received or app usage. These two methods produce predictive models of poverty, which are used to efficiently direct financial and humanitarian aid.

The Inquiry

My research question is an exploratory inquiry, which means I am trying to find out what is happening through an initial exploration of this poverty phenomenon and am generating questions for further investigation. We know that the 2 geospatial data science methods are effective in predicting and mapping poverty in highly, but does it have the same capability in low-concentrated areas of poverty? This inquiry is also classified as a development puzzle, in that I am exploring how CDR and Satellite imagery works and can be refined to suit particular datasets. The scope of my research question is less broad (broad nonetheless), compared to when I was exploring poverty in all of Eastern Africa. To answer this question, we are going to explore 3-sub questions to truly understand the capability of CDR and Satellite imagery. How does CDR and Satellite Imagery work? How have they been used in past studies? Is there room for improvement and refinement? In this paper we will answer all of these questions and dissect these geo-spatial methods

Satellite Imagery/RS data

The use of satellite imagery and RS (remote sensing) data has played a crucial role in producing poverty maps. Satellites from outer space are able to collect data ranging from: nightlights, conditions of roads/homes, etc. I would like to focus on a study done in Rwanda conducted by Asmi Kumar. Kumar used neural networks and satellite imagery to predict asset wealth. The study started off by downloading Demographic and Health Surveys (DHS), which were surveys that provided household data for health, nutrition, and assets. Next daytime images were obtained from Google Maps, which provided images of the landscape (conditions of roads/homes) and activity. Finally, nightlights were extracted from satellite imagery based on locations provided from the DHS dataset. These 2 components (Day time images and nightlights) merged with the DHS data, in order to create cluster maps.

To the left we have a map of asset scores that were modeled by nightlight luminosity. The dots signify cluster wealth (red dots mean low cluster wealth and bright blue means a high cluster of wealth). To the right we have a linear regression model, and we can see that the R squared value would be high for when Average nighttime luminosity is low, but rather all over the place as luminosity increases. Nighttime luminosity is a wonderful indicator for lower poverty, but not really for higher asset wealth. When switching to day-time imagery, and merging the data we can see a great increase in the r-Squared value (its predictive capability)

Here we have 3 linear regression models (blue, green, red). The blue linear regression model has an r2-value of 0.558, when only using basic daytime features. Clearly, this value is subpar at best, so researchers decided to construct a pre-trained CNN (Convolutional Neural Network) in order to leverage a dataset of both daytime and nighttime features. After doing this the r2-value rose to 0.689. In attempts to improve model accuracy, Kumar implemented a transfer learning step. Rather than directly using image features collected by the CNN, the model was retrained to predict nightlights based off of daytime imagery. By repurposing the model to estimate nightlight intensities, the model was able to indirectly predict asset wealth. Thus, fostering significant rise in the r2-value of 0.718. A normalized confusion matrix was produced to show that Transfer learning improves the accuracy significantly, in that nightlights alone fail to characterize low-brightness areas well. The variables assessed in this study were night lights and daytime features. It was interesting to see that the data collected had both aspects of temporal and spatial dimension, which I feel contributed to the success of the predictive models. There was also an array of methods used including satellite imagery, neural networks, and clustering. Surveys were used as a groundwork to properly localize where the poverty imaging took place and to work with daytime and nightlight images. This study was quite interesting in that it relates to my research question directly. It showed that researchers can utilize CNNs with daytime/nighttime imagery combined with existing survey data to pinpoint and predict poverty in specific places. This method is both scalable and cost-efficient which addresses many of the salient harms I introduced at the beginning of this paper. The only gap in the literature that I was able to critique was the fact they used Rwanda in this study. The study showed that the models are able to perform well in specific areas, so it would be nice to see it perform in states such as California or New York where there are very small clusters of poverty hidden in wealthy areas.

CDR Data

Now we come to CDR (Call Detail Record), a geo-spatial data science method that operates in a similar matter to nightlight and daytime imagery. While satellite imagery and RS (remote sensing) data rely on physical properties, CDR relies on data produced by phone usage and cell towers. Phone usage can indirectly indicate access to financial resources, and movement of the phones themselves can signal individual migrations to better economic opportunities. Many people correlate data collection from phones, as an invasion of privacy; However, CDR data preserves individual’s privacy by deriving data at the physical level of cell towers. This being one of the salient harms of data science, it is interesting to see how Jessica Steele (the author of “Mapping poverty using mobile phone and satellite data) addresses this potential harm. Steel uses a hierarchical Bayesian geostatistical model (BGMs) to produce high resolution maps of poverty. For which there. A BGM is a model that uses probability to represent uncertainty within a given model. This uncertainty can regard both input and output data, which once again addresses the salient harm of bad data/analytics. Steele started out by approximating mobile tower coverage using Voronoi polygons, which has the capability of mapping spatial detail in urban and rural areas.

In this image we see the spatial structure of Voronoi polygons based on mobile towers in Bangladesh and the zoomed-in window shows Dhaka. Each section or polygon was assigned RS and CDR values that represents data collected from PPI, DHS, and income surveys. PPI is a tool that measures household characteristics and asset wealth and uses satellite imagery to assess whether a household is living above or below the poverty line. The DHS wealth index once again uses assets and physical characteristics to assign a WI (wealth index) score. Finally, income surveys provided individual information such as household income, age, profession, education, and phone usage. All this information combined provide a value or score for each Voronoi polygon, providing information for each sector of land in Bangladesh. The DHS wealth index ,when using CDR and RS data actually obtained a higher r2-value in urban areas than rural areas. This is exactly the answer to my research question that I was looking for in that, when CDR data is combined with RS data it can predict poverty in cities where poverty is not as openly apparent. Though CDR-data alone does not perform as well alone, it is still very interesting to see its effect. CDR data is updated more frequently, thus explaining its positive effects to the results.

Here we see a cross-validation of based on the data generated by the Bayesian geostatistical model. We can clearly see that DHS WI with CDR-RS data performs the best. To the right we can see the predictive capability of a certain population living below the poverty line. Unlike the study with satellite imagery conducted by Asmi Kumar, this study was based in Dhaka (an urban city). Dhaka is the capital of Bangladesh, and this is the first study I came across that looks at poverty in the Urban setting. This addresses my central research question of :“ Can these predictive models and mapping work in regions where poverty is not as apparent or outright?”. It was very interesting to see the different perspective of how the BCMs produce accurate high-resolution poverty maps in LMICs (low- and middle-income countries) in a way that is complimentary to normal census and survey information. Of course, CDR data produced the best results when combined with RS data; however, this study still showed the significance of CDR.

Conclusion

Witnessing the capability of these 2 geospatial data science methods was very impressive. Although, I feel that satellite imagery produces better poverty maps given the amount of information it can collect; I still feel CDR data is important. The results that were produced show real promise, and the next step is to evaluate countries or states where poverty is not apparent. However, after seeing how these methods operate and the work with each other to produce efficient poverty maps; I doubt that this will be an issue. These methods have shown that they can address the salient harms of: fixation, cost, security/privacy, and bad data/analytics that come with data science. The data and findings proved to analyze essential elements of my research question. It gives me great comfort that researchers are constantly editing and improving their data collection and analysis. They have come a long way, and I know that in a few decades; poverty will simply be a part of the distant past.

Sources

Castelan, C. R. (2019, July 9). Making a better poverty map. Retrieved March 22, 2021, from https://blogs.worldbank.org/opendata/making-better-poverty-map

Horton, M. (n.d.). Stanford scientists COMBINE satellite data, machine learning to map poverty. Retrieved March 22, 2021, from https://pangea.stanford.edu/news/stanford-scientists-combine-satellite-data-machine-learning-map-poverty

Kumar, A. (2020, July 06). How to understand global poverty from outer space. Retrieved March 22, 2021, from https://towardsdatascience.com/how-to-understand-global-poverty-from-outer-space-442e2a5c3666

Martinez, A., Jr. (2021, March 11). Using machine learning on satellite images to map poverty. Retrieved March 22, 2021, from https://development.asia/insight/using-machine-learning-satellite-images-map-poverty

“Pape, Utz; Parisotto, Luca. 2019. Estimating Poverty in a Fragile Context : The High Frequency Survey in South Sudan. Policy Research Working Paper;No. 8722. World Bank, Washington, DC. © World Bank. https://openknowledge.worldbank.org/handle/10986/31190

“Pape, Utz; Wollburg, Philip. 2019. Estimation of Poverty in Somalia Using Innovative Methodologies. Policy Research Working Paper;No. 8735. World Bank, Washington, DC. © World Bank. https://openknowledge.worldbank.org/handle/10986/31267

Steele, J. (2017, February 19). Mobile phones can create high-resolution poverty map. Retrieved February 23, 2021, from https://www.indiatoday.in/technology/news/story/mobile-phones-can-create-high-reolution-poverty-map-959791-2017-02-09

Steele JE et al. 2017 Mapping poverty using mobile phone and satellite data. J. R. Soc. Interface 14: 20160690. http://dx.doi.org/10.1098/rsif.2016.0690