D A T B L O G: 2015

Thursday, April 2, 2015

Day or Night?

I stumbled upon a really cool website for programming practice: HackerRank

[Disclaimer: I have no affiliation with them!]

I like HackerRank because they have compact questions that challenge pure topics in programming. There are contests but not like ChallengePost or TopCoder where competitions are coupled with a company trying to promote their newest tool/platform/etc. HackerRank has questions that fall within domains. Some of which are algorithms, machine learning and much more. You can write them in any language. Submit and wait for your scores. That's it. The fact that the scale of the problem is compact such that it can be resolved in 1 hour makes it good for interview practice. So check it out.

So with that said, I had a go at the machine learning question: determine if an image is night or day. That sounded fun. Try it yourself.

What I want this post to be is like a stream of consciousness of going about this problem. I think this will be interesting also because I Googled something along the lines of "determine day or night image" and nothing immediate came up!

So my first thought was: blue!

When a movie has a nighttime scene, oftentimes it is filmed during the day and given a blue tint to make it look like night. (see "Blue Tint" in this Wiki article). So my thinking was if I break down a picture and I saw more "blue", it must be a night time picture. (I will explain some flaws in just a bit).

Credit: Hejl. License. No changes.

So we know we can extract the RGB values from the pictures. In short, all colors in a picture is comprised of the combination of 3 fundamental colors: red, green, and blue. When we digitize it, we can quantify the amount of each using a value between 0 and 256, where 0 means there is not presence of the color and 256 means the color is fully present. So we able to identify how much "blue" is in a picture!

But how do we summarize the "blue-ness" of a picture? Should we sum the blue value in every pixel? No, that wouldn't be good: bigger pictures have more pixels and that would mean that they are more "night-like"? I think the average blue value would be a better summary. So I did that. And since this is a machine-learning challenge, I went finding pictures for a training set. Check out the set here!

I wrote some Python code to summarize the RGB and I've plotted them against each other:

Look at all the night points: they all clump towards the bottom left. All of them are low in value! Blue isn't a good metric. Going by what I said earlier, a giant picture of a blue flower in broad daylight would be consider a nighttime image! The better metric would be that all color values are low.

So what should I choose to classify these data points? Linear regression would not make sense here. I need to establish that photos with low RBG values are night time images. I chose the very simple Naive Bayes. (are all features really independent though? To be discussed...)

This project will be continually updated. I'll reveal the outcome of Naive Bayes next time. And then we'll tackle the proper way to do image recognition: convolutional networks.

Sunday, March 1, 2015

LA Food Deserts

Instructions
Desktop version

iPad version
(I do not recommend viewing the visual on mobile)

Click on a zipcode.

The selected zipcode remains and other zipcodes should be grayed out.
Hovering over a polygon will give you the zipcode, city name, and median income for the area.

The "Grocer Yelp Ratings", "Access Points", and "Adjusted Gross Income" information should update with data for the selected zipcode.

To use the filter box, type in the zipcode and press "Enter". All zipcodes should disappear and the matching zipcode will remain (if any). No information will update until that zipcode is clicked on.

Multiple zipcodes can be selected also.

Background

When I worked in South LA, it seemed like there were no grocery stores nearby. Sure, there were plenty of fast food places around for burgers and fries. But when I wanted buy groceries to leave at my desk for the work week, it seemed like I had look far and wide. Why was it so hard to find a grocery store for food but a McDonald's was right around the corner? I was beginning to believe I worked in the middle of a food desert.

A food desert is a geographic area where residents lack access to healthy food options. And lack of access could mean many things: distance, convenience, or low-income. All these things could potentially affect a resident's ability to acquire healthy food options.

I looked on USDA's Food Access Research Atlas to see if I lived, according to the US government, in a food desert. To my surprise, the area where I worked was not a food desert! A grocer may have been within a mile away but it felt so inconvenient when all these other food options were so common. There was a casual bakery/cafe place known for having healthy sandwiches but they are pretty pricey. It would add up to a serious expense if I ate there on a daily basis! So I was experiencing a lack of access and I didn't even live in the area! I just worked there!

Map Features

This inspired me to make the visualization you see here. I'm proud to provide a visualization of food deserts that captures more of the factors that define the issue. Lack of access is a multi-faceted problem; it's more than just the distance to a grocery store. Poverty and income come to play. Do the grocery stores provide an economically viable means of food to residents? The USDA used a population-weighted centroid to compare distances of census tracts to grocery stores. I was able to find a dataset for Los Angeles that provided all residential zones. So distances are calculated from the residences themselves!

The density of grocery stores vs. other food options are important. Especially in urban areas where distances are much shorter than rural areas, fast food places could be an easy choice just by sheer abundance. McDonald's is only used as a metric to quantify convenience. McDonald's is not the cause of food deserts. It makes a great metric because they are affordable and very commonplace.

The Yelp ratings should also help identify if it's an affordable option to grocers. What would make sense is seeing more upscale grocers near higher income areas. If there were upscale grocers closer to poor areas, it would be unlikely that the residents can afford it. And if they can't afford it, they are more likely to get their food elsewhere.

The income distribution is to give a viewer a better summary of the social-economics of the area.

With the Yelp Grocer Ratings and socioeconomic distribution included in the data set, I hope to provide a higher fidelity to the food desert issue in Los Angeles in addition to the distances and abundance of grocers.

Why are you picking on McDonald's?
McDonald's is constantly under the public lens when the topic is public health and nutrition. McDonald's is included not because it is the cause of food deserts but because it is such a great metric. I believe that convenience and cheap will always win and McDonald's embodies both. It's distribution in a zipcode can be a great sample of the access a citizen has to fast food. After all, McDonald's isn't the only fast food option available. There are many others. So to represent the alternative choice a citizen would have as opposed to grocery stores, I chose McDonald's.

But if you believe convenience and cheap will always win, what chance does grocery stores and healthy eating have?

This is where I believe education plays a strong role. Knowledge of the effects of consuming too much fast food to often. Education and knowledge would be a great addition to this dataset since I try to capture all facets of the problem. However it would be very hard to identify something to quantify education...(email if you think of something!)

photo credit: Karen Chu

Analysis

Distance to a grocer is not a good indicator of poor access. By examining a zipcode like Malibu 90265, you can see that the median distance to a grocer is ~2.5 miles. And 50% of residents drive further than that! However, the medium income for the area is $200,000 or more! Those with more disposable incomes can afford the longer trip. So food deserts should be identified with more than one metric. At the very least, it should be based on income and distance to a grocery store. So let's examine, zipcode 90810, an area within Long Beach. It is identified as a low-income and low-access area. There are 3 grocers and if the median distance is a mile, 2 grocers are within a mile! All the while, there are 2 McDonalds' that are within half a mile! Looking at the Yelp rating, there only one rating and it's two dollar signs ('$$')! Surely, a low-income area would benefit from a grocer that is considered a little more affordable (one dollar sign '$').

The Long Beach zipcode 90810 can be seen to have both ends of the income spectrum. There are people throughout the entire income ranges. This isn't particularly the norm. It's possible, like in zipcode 90059, to have nobody earning over $200,000 in the area. So in a diverse area such as Long Beach, there are high-income and low-income neighborhoods. If there were further investigations, what kind of neighborhood is close to the grocer? The high-income or the low-income neighborhood? What would it mean if the grocer serves one type of neighborhood instead of another? So it seems that even within a zipcode with grocery stores, access can still be poor for certain neighborhoods.

A low-income or low-access area follows the definition provided by the USDA:

1. They qualify as "low-income communities", based on having: a) a poverty rate of 20 percent or greater, OR b) a median family income at or below 80 percent of the area median family income;
2. They qualify as "low-access communities", based on the determination that at least 500 persons and/or at least 33% of the census tract's population live more than one mile from a supermarket or large grocery store (10 miles, in the case of non-metropolitan census tracts).

I hope that this provides a higher fidelity to the issue but I don't claim this will solve it. It should be used as a tool to identify the areas and then those who can investigate at the ground level, politicians and health advocates, can determine what solutions would best apply to that case.

Impact

My dataset can be useful to politicians resolving the issue of food deserts. Los Angeles dictates incentives for grocers to open their doors in South LA [link]. This visualization can help them determine more specifically more regions that greatly benefit from having a local grocer. The grocery stores themselves can determine whether their branch is best suited price-wise to open in a region. Public health advocates can use it to identify locations that will greatly benefit from educational actions for health awareness.

Update Awesome video about "South Los Angeles" and Food Deserts.

Tools and Data

Data sources

IRS SOI Tax Stats for Adjusted Gross Income by Zipcode
LA County GIS Portal: Zipcode Boundaries and Residential Zoning
McDonald's and Grocery Store locations are from Yelp, crawled by import.io

Tools

Crawler: import.io
QGIS: distances and geographic-based joins
Viz: Tableau Public (beta)
IBM Analytics for Hadoop Bluemix: data wrangling

Thursday, February 26, 2015

Lunar New Year Winning (?)

I gambled for the first time in my life at the age of 5. I also learned that I was a sore loser. I cried watching my money get snatched away. It wasn't even my money. It was my dad's.

I was playing the classic Vietnamese game bầu cua cá cọp. Commonplace during the Lunar New Years, this dice-game has 6 squares, each containing an image of one of the following animals: stag, crab, fish, prawn, rooster, and a fruit (calabash squash)! Place money, any amount, on any square, as many squares are you want. 3 dice are rolled. If one die faces up with an image that matches a square you have placed money on, you win the amount you've bet. If 2 dice face up with an image that matches a square you have placed money on, you win 2x the amount you've bet. If all 3 dice face up with an image that matches a square you have placed money on, you win 3x the amount you've bet! Any money on losing squares, squares who image does not appear face up on any of the dice, get gathered up by the "dealer" (the person running the game, most likely the owner of the board and die, and the bowl and plate it's shaken in).

So let's say I put $1 on the squash.

If one dice comes up as "squash", I receive $1 from the dealer.

The probability of this happening is the chance one die faces up with the squash and the others do not face up with the squash.

Scenario 1

Dice 1 = "squash" AND Dice 2 = "NOT squash" AND Dice 3 = "NOT squash"

P(dice 1 is squash) * P(dice 2 is not squash) * P(dice 3 is not squash)

1/6 * 5/6 * 5/6 = 25/216

But Dice 2 could face up with squash, while the others do not, and you can still win $1!

Scenario 2

P(dice 1 is not squash) * P(dice 2 is squash) * P(dice 3 is not squash)

5/6 * 1/6 * 5/6 = 25/216

And don't forget Dice 3!

Scenario 3

P(dice 1 is not squash) * P(dice 2 is not squash) * P(dice 3 is squash)

5/6 * 5/6 * 1/6 = 25/216

So the probability that you win $1 at all is if Scenario 1 OR Scenario 2 OR Scenario 3.

P(Scenario 1) + P(Scenario 2) + P(Scenario 3)

25/216 + 25/216 + 25/216 = 75/216

What if we get lucky? Suppose the chances of having 2 dice come up with the squash.

Dice 1 = "squash" AND Dice 2 = "squash" AND Dice 3 = "NOT squash"

P(dice 1 is squash) * P(dice 2 is squash) * P(dice 3 is not squash)

1/6 * 1/6 * 5/6 = 5/216

And don't forget the other combinations of dice:

P(dice 1 is squash) * P(dice 2 is not squash) * P(dice 3 is squash) = 1/6 * 5/6 * 1/6 = 5/216

P(dice 1 is not squash) * P(dice 2 is squash) * P(dice 3 is squash) = 5/6 * 1/6 * 1/6 = 5/216

The probability that you win 2x the amount you bet is the sum of the 3 combinations:

P(2 dice face up with your image) = 5/216 + 5/216 + 5/216 = 15/216.

What if we got REALLY lucky? To win 3x the amount you bet, all 3 dice have to face up with your square's image:

P(dice 1 is squash) * P(dice 2 is squash) * P(dice 3 is squash) = 1/6 * 1/6 * 1/6 = 1/216.

Winning 3x the amount you bet is rare. 1/216 means there is less than half of a percent chance to win!

What are your chances that AT LEAST one die come up with your square? That's the probably that one die come up OR 2 dice come up OR 3 dice come up!

P(one die matches) + P(2 dice matches) + P(3 dice matches) = 75/216 + 15/216 + 1/216 = 91/216

That's a 42.13% chance that you win any money at all! That's pretty close to 50%! That's not bad! Now what is your average payout?

For that, we multiply each probability with the profit of each scenario:

($Profit for a 1 dice match)*P(1 dice matches) + ($Profit for a 2 dice match)*P(2 dice matches) + ($Profit for a 3 dice match)*P(3 dice matches) + ($Profit for no dice match)*P(no dice matches)

($1)*(75/216) + ($2)*(15/216) + ($3)*(1/216) + (-$1)*(125/216) = -$0.0787

On average, you lose 7 cents a game! That doesn't sound too bad in the long run. But what if you're the dealer? And there's multiple players? And they are betting multiple squares? That can really add up :)