Part 1: Building Citrics, an App that Facilitates Your Moving Process
Lets face it, if you’ve gone through the process of relocating, you know how much of a daunting task it can be. Will I be able to find work in my field in this new city? Oh, I love the weather and atmosphere there during the summer, but can I afford the rent? I hate to leave behind my friends and family, will it be easy for me to meet new people in this city? Common questions you might find yourself asking when it comes time to make the next step in your life. Although this process is anxiety provoking, and justifiably so, we’re working alongside Citric’s to ease that stress. Citric’s is an organization that understands finding a new place to live is hard and has a passion for helping you find the right city.
Citric’s is an app that allows you to select, or compare multiple, cities and view some of their most important metrics; such as historical population, age demographics, seasonal weather averages, rental prices, unemployment rates, and the cities job-industry break down in general. A myriad of city metrics that helps assess many of the need-to-know worries a person faces when enduring the moving process. As of our first release, you can see and compare all of these metrics for 100 of the top cities in the United States. Coming into this project, I was worried that we wouldn’t be able to collect accurate data across all of the metrics for every city we wanted to include. But as I stand here… well, sit here, writing this I can happily say that we’ve come together and successfully deployed our first release.
Despite being very eager to start tackling the task at hand, as a team we first sat down and started our plan of attack for our first release. I can’t stress enough here how important it is to fully think through and understand the problem you’re faced with, before diving in. At Lambda School we have a mantra; well, more of a problem-solving framework, that many of us adhere to when facing just about any obstacle. This framework is known as UPER, an acronym that stands for Understand, Plan, Execute, and Review. While it may seem tempting to hop right to the execute portion of the framework, the U and P portions are arguably the most important. Our teams first start at understanding and planning was to break down the product by user-stories and then place these stories into a Trello. User stories are short, simple descriptions of a feature told from the perspective of the person who desires the new capability. So in our case, an example user story would be, “As a user, I can view the historical population of a city”. Once we have a user story, we add it to the Trello, a visual tool for organizing your work among teams, and start to collaborate and brainstorm all the possible ways to address that user story.
As you can see, the organization of our team’s thoughts and user story breakdown leaves a brief and concise game plan that would be easy for any member of our team to follow. Not only is this good for communication and all staying on the same page, but it also helps me stay focused on what needs to be done, instead of just working and implementing my own ideas that may not be conducive to the team’s overall mission. A common aphorism states that time is money, and when you have a month to release your first fully running app, those words could never ring truer. Breaking down the product this way has been instrumental to us completing our tasks on time, keeping open communication across our teams, as well facilitating new ideas and constantly developing the best app that we can imagine.
How I Learned to Stop Worrying and Love the Data
Over the last month, our team has overcome a lot of obstacles to achieve a working app that will be helpful to countless nomads and other’s who find themselves frequently looking for a new place to live. It has given me a great sense of satisfactory, but that didn’t come without its fair share of both technical and team challenges. I’ve learned a lot on the way, persevering and overcoming these challenges, and I’d like to share some of them with you. Lets start by taking a look at my main contributions to the project.
Historical Population and City Demographics
Historical population and city demographics were my first main contribution to the project. I was tasked with cleaning, analyzing, and combining 9 datasets, with city information from 2010–2018, into two datasets. The first dataset holds only the current information, just in case the user is looking for current statistics of these metrics, while the other dataset holds all 9 years of information of each city, which will be used for time-series visualizations and making predictions come release two. After getting the data together and having it look exactly how I wanted it too, I was able to start putting together some API endpoints to bring these dreams to life. After a few syntax errors, we we’re finally able to deploy. After pulling up our website and making a search for San Francisco, California, this is what we’re able to see.
Ahh, even looking it now, it still brings me a great sense of pride and satisfaction. Though it may not look like much yet, it took many hours of cleaning and analyzing in order to produce these results. This leads me to my biggest technical challenge yet.
Sourcing, Cleaning, and Analyzing years of demographic data
Although historical population and city demographic metrics may seemingly go hand-in-hand, it wasn’t easy to source and clean the data in a way that would make it easy to work with. This was the first technical problem I encountered, as I spent many hours diving into the United States Census Bureau’s API and seeing exactly what data could be pulled from their database. After an exhaustive evaluation of their API, no offense US census but please put a data science member into your data acquisition department, I was able to query a table that had information on population, sex demographics (male/female), and age demographics(teens/young adults/adults, etc.). Now, you may be sitting there thinking, “C’mon Zack… that’s not too much info, I could do that in no time!”. Well, I’ll never claim or call myself the best/fastest data cleaner this side of the Mississippi, but I will say that I put in a lot of effort. When opening my first csv, using a pandas data frame, this is what my eyes were met with.
So many challenges pop right off the page as I look at it, but to the un-trained eye, you may just think this looks like any other dataset. Let’s break this data frame down piece by piece. First, all of the headers are rather ambiguous, with column headers such as; “S0101_C01_001E”, “S0101_C01_005M”, and others. Funnily enough, the row right underneath all of these headers, seem to provide more accurate descriptions to what the columns really are. These should probably be made into headers, but even then, these still present issues that need to be changed. Why are they sort-of camelCase’d, seemingly adding Upper case letters to the beginning of certain words, or making entire words Upper case. As data scientists, we can handle these issues relatively simply, but there are hopes that people entering this data understand our plight and could make this easier on us. They’ve also included many exclamation marks, for example, “Total!!Estimate!!Total population”. They seem to be putting exclamation marks in place of spacing? But subjectively because sometimes they still include spacing instead? It’s wild, I don’t mind the challenge, but I just don’t comprehend what they’re trying to accomplish with this sort of data entry. Here’s a nicer look at that same data frame after being cleaned.
Again, this may not look like much to you, but a lot has gone on behind the scenes to produce this results. First, I analyzed the city column, and decided to take out any redundant information while also splitting the column into two separate ones, city and state. This change was made mainly for accessibility, as it’s hard to query cities like Boise City city, Idaho (yes the dataset legitimately added city to city names that already included a city, this was not a typo). However splitting these columns is also good for aesthetics and functionality. Another change I made, that’s easy to overlook unless I directly point it out like now, is feature engineering the age columns. If you look above at the original data frame, and can comprehend what any of the columns are telling you, you’ll see that they give the age demographic in many different age groups. I was able re-write the age group column headers and then take it a step further and engineer them into features that will be more relatable to our users.
To break this down again, I had to take their columns and rename them so that I could understand how they broke down age to begin with. Then I engineered more relatable and less cluttered columns for users to eventually interact with, and then went back and deleted my original column changes because they were no longer necessary and just made the data frame harder to read and understand. It was definitely a lengthy and thought provoking process, but after breaking down the problem and coming to an aesthetic conclusion, I feel proud and successful knowing that my work will facilitate the work of my other teammates, in turn eventually giving the user a better experience with our app.
A challenger appears
Outside of these technical challenges, our team also had to face a challenge that subjectively seems more difficult. A few weeks into release one we lost a great guy, teammate, and data scientist in Karl Manalo. Karl unfortunately is going through a lot in his personal life at the moment, and after speaking with us about his hardships and tribulations, broke the sad news that he’d be going on hiatus until he can get himself into a better spot, mentally and physically. I fully respect his decision, there are priorities in life, and in my opinion he’s putting his best foot forward and doing what he needs to feel safe and secure. However, he was a bright spot to our team and a creative problem-solver, the kind of person you definitely notice missing from your data science team. The rest of the team quickly came together to try and remedy the situation. Luckily we have a superstar team and were able to do just that, we split the portion that Karl was working on, with Ekram taking control of job-industries while I focused on collecting unemployment data. Despite getting the data from a different source, bureau of labor statistics, than my population data, US census, the data provided was rather similar. This made the cleaning relatively easy, because I had already laid the blueprints for how to clean such data. Through persistence and hard work, we were able to get the endpoints up just in time for our first release, an objective that I honestly didn’t think we’d be able to achieve. It’s amazing what teamwork and perseverance can accomplish. Here’s a look at my unemployment endpoint and a rough visualization that I’m passing on to the web team.
Where are you going, where do you go?
It’s been a long journey so far, at least it’s felt like one. In a months time, we were able to fully plan and prepare our project, break it down in user stories, tackle these stories checklist by checklist, and talk through our decisions in a way that has allowed us to build a beautiful app, that displays great pertinent information to the user, and has left the team proud and with a great sense of camaraderie. We’ve implemented all of the metrics data that we set out to in the beginning, including; historical population, sex demographics, age demographics, rental pricing, average seasonal weather, the unemployment rate, and the top job industries of every city. The web team has been able to take that information and hash out a beautiful and user friendly app, one that I think will really resonate with our users. The user can either chose to select and view the metrics of one city, or choose up to 3 and compare them through both numbers and visualizations. Here’s some screenshots below of our overall product on release 1.
I’m excited heading into our second month of product work, this month we’ll be using models to make time-series predictions, my favorite part of the workload. Before we get there though, we’re going to have to move away from csv files and start populating our own DS database. I have a lot of experience using relational databases, like Postgres, but have not done so through AWS, which is an opportunity I look forward to tackling. After we get the database set up, we can start working on the bread and butter of release two, the predictions. Ideally, I’m going to want to go back and source more data. For release one, I just took data that was pertinent and easy to find, knowing all that I needed to show was about 5 to 10 years worth of data for my metrics. However, when it comes to making predictions, I want my models to produce the most accurate results, and gathering as much data on these metrics will put myself in the best place to succeed. Once we all feel comfortable with our data sets, we have to include our tables to the Postgres database and start designating primary keys to link them all together. This is where I can foresee some challenges, as we most likely will all have different years of data, even if we can sustain the structure of our city list. I have a decent amount of SQL experience under my belt, but this is beyond the scope of anything I’ve ever accomplished it. While this is a challenge, I’m excited to get there and tackle it. Anything I can do to further my knowledge and better myself as a data scientist, I’ll jump on immediately. I don’t have an answer yet, but have been researching this topic and have seen inspiration that this is a problem other team’s have tackled before.
This project has been highly effective in brushing up on some critical data science skills that have been put on the back burner for a few months now. The data sourcing and cleaning was a long process in the beginning, but once I took over the unemployment data, I was able to cut that process time almost in half. Another key takeaway, is creating a wrangle function. For my first notebook, I brute forced my way through the cleaning and re-organizing of 9 separate data frames. While this worked, it wasn’t time efficient and anyone looking at my notebook would be appalled and probably lost trying to keep up with what I was trying to accomplish. Karl included in his peer feedback of me that I was performing quality work, but could be so much cleaner and more efficient with a wrangle function, so I implemented it in my unemployment notebook and the difference is night and day. Speaking of feedback, most of my feedback was positive, which didn’t come as much surprise as my primary two data science teammates, Karl and Ekram, have been partnered with me before on projects, so we had already established a good rapport. With that said, one of my biggest flaws is my communication, and keeping in close contact with my teammates . When our review came in, I felt relieved and extremely proud that they didn’t echo those sentiments at all, but assured me that I’ve been doing a great job communicating, offering my insight, and being helpful in any way that I can. It’s taken a lot of hard work for me to get to the point where I’m receiving such praise, and it’s hard for me to put into words how much it means that they said such kind things about me.
Moving into the job market, I think this experience has been highly beneficial. From my improvements in communication, working closely and openly as a teammate, thinking critically while sourcing/cleaning/analyzing, and working cross-functionally to set up working endpoints, I feel like I’ve improved so much as a data science and a person during this month. Well, not just this month, my whole time here at Lambda. I legitimately feel like a new person, full of energy and striving to learn every little thing that can help me become a better data scientist. With this new found sense of ambition and excitement, I look forward to carrying this into my job interviews and ultimately getting hired. Every step of the way, I learn that there’s so much more for me to know. But that’s a good thing, you just have to keep an open and positive mindset about it, and go into every day hoping to come out with another piece of information you didn’t know before. This project has also given me a lot of interview talking points; from deploying locally on docker, to actually deploying through AWS Elastic Beanstalk, to breaking down the problem into user stories and utilizing team organizational tools like Trello, I have no shortage of experience to derive my answers from. This has been an amazing adventure so far, excited to see what we can achieve with another month of hard work.