Part 2: Building Citrics, the App that Facilitates Your Moving Process
Hi all you cool cats and kittens… wait, that’s not my opening… hey guys and girls, we’re back at it again, this time with a fresh update on our Lab’s project as my time comes to an end here at Lambda school. If you missed the first part of this blog series, follow the link here to get a better look and understanding of what this project’s all about and the obstacles our group have tackled along the way. With my brief introduction out of the way, lets get down to the nitty gritty of what my team’s accomplished in our last month of production, shall we?
There is a light and it never goes out
I know, I know… lyrics from The Smith’s? But these words are very apropos and epitomize our teams experience in this second and final month of project work. Being down a data science teammate definitely made an impact this time around, as we had so many big plans and even with putting in the extra hours we didn’t come close to completing everything. My main contributions this time around were sourcing additional historical population demographic data, performing time-series forecasting on this newly updated dataset, combining all of our current city metric data into one table, populating a Postgres database with all of our data, and creating and consolidating DS endpoints to facilitate passing data to our backend web team. It’s a lot to cover, so lets break right into it, shall we?
Predicting future population
This was the part of the project that I was most excited to get to, predictive modeling. Unfortunately, before getting to that part, I needed to try and obtain as much data as possible in order to return the most accurate predictions. I managed to obtain the new 2019 population demographic data from the census, which turned out to be very lucky, as getting that data further back in time proved to be fruitless. I was using the ACS-1 year surveys given by the US Census as my data source, they date all the way back to 2005 but after searching tirelessly for years 2005–2009, I had to accept the fact I wasn’t going to be able to obtain that data. Very unfortunate, because I only had 10 years of data to train on, but I continued to push forward. I started experimenting with models, forecasting using linear regression (my baseline, that actually returned pretty accurate predictions, roughly 80%), random forest regressor, and XGBRegressor. Unfortunately, the more complex models were over-fitting on the small sample size and I was forced to move in another direction. After speaking with our DS manager in Labs, Ryan Herr, he gave us the idea to look into Facebook Prophet, a forecasting model that handles not only small data well, but also large gaps in data. I was skeptical at first, but the model actually performed really well and gave pretty accurate predictions for the first 5 years.
The model did require you to feed it two specific columns, one designated ‘ds’ for the observation date and ‘y’ for the predicted target. But other than that, the fbprophet model is very intuitive and easy to learn. With that said, there are many more features and intricacies to the model that I wasn’t able to explore in this month of work, and I implore you to try doing some time series forecasting yourself to try it out.
Postgres and chill
My next big contribution to our team’s project was to join all of our current city metrics together into one table and populating an AWS RDS Postgres database. Joining the current metric data was rather simple, since it’s only 1 year (the most recent) of data and I can assure that I won’t be adding NaN’s into our dataset and possibly breaking some of our features.
Originally I wanted to expand my SQL knowledge and combing these datasets using queries, but I needed to get this data available as quick as possible to our team, so I decided to merge them using pandas in a google colab notebook, as it’s a process I’m much more familiar with. As you can see, with every merge, the second dataset would bring in redundant information. So after the first one, I decided to drop those columns before the merge and the process became a lot more streamlined. With all the data collected, and all of the combinable data combined, I needed to populate the Postgres database that I set up on AWS RDS. Setting up the database itself is it’s own feat, but I don’t have enough time to chronicle that endeavor. I’ll leave this link here though, which as documentation on setting up your very own database through AWS RDS. When it came time to populate the database, I used a very cool and useful program called TablePlus. TablePlus is extremely intuitive in regards to both connecting to the database and building tables. You can either connect to your database via URL, which is a very handy feature, or you can manually enter your DB credentials and connect the old-fashioned way. Once you’ve successfully connected, building tables is as simple as right clicking and importing your CSV’s.
Before I knew it, I had our database populated and ready for querying. Unfortunately I ran out of time trying to implement that database into our DS endpoints, but it’s a project that I look forward to tackling after achieving endorsement.
This is the end…points section
Let me start off by apologizing for that horrific pun, but sometimes I can’t help myself. Prior to this project, I had only set up endpoints a few times in prior cross-functional team exercises, and I feel like I really grew and gained the most understanding with regards to this part of the project. At the start of the month, we had separate endpoints returning data for every metric that we’ve covered (population, unemployment, weather, rental, job-market). Even though this process worked, it wasn’t very effective… the endpoints were cluttered and messy and the query times we’re building, making our web application less responsive. To remedy this, I took the merged dataset with all of the current city metrics in it and used it to consolidate our endpoints into one. Unfortunately jobs didn’t make it into the list, but with our lack of DS teammate we weren’t able to wrangle the jobs data into a format that would merge friendly with the other datasets. I don’t have a picture to show you how cluttered the endpoints were before, you’re just going to have to believe me, but this is what they look like now after being consolidated.
Ahhh, fills my heart with joy to see the endpoints so clean and precise. On top of the aesthetical benefits, we also created and stuck to a naming convention that made it easy for our front end to use, but it also sped up the loading times on our web app… just an overall thing of beauty. It was a lot of hard work but it really paid off here at the end of our labs experience.
Coming to a close
With our labs experience and time here at Lambda coming to a close, it’s my pleasure to share with you our project in its current state of affairs. In part 1, you saw that we had our web app set up so you could search and compare cities only by name, but this month we added in an advanced search feature. Our labs TL, and fellow data scientist, Bhavani took on the challenge of creating an advanced search endpoint and it worked amazingly. Now, users can query certain metrics to their liking, be returned a list of cities that meet their criteria.
This is a critical addition to the project and one that will be highly sought after by future users. It’s very likely that a user will have specific preferences in regards to where they want to live and this feature will make it exponentially easier for them with the advanced search implemented. Another huge addition since my first blog post was including our population and rental predictions, allowing users to get a visual glimpse into how their desired city is trending over the next 5 years. Ekram tackled the rental prediction, while I covered the population and population density forecasts. As I alluded to earlier, we both used Facebook Prophet to model our predictions and we’re provided with results that we’re very happy about.
These predictions will help users gain an insight into how their desired city is trending, invaluable information that will allow someone moving a clear mind that they’re making the right choice moving forward in their life. These additions have been instrumental in solidifying our web application and have been a fun, but challenging learning experience implementing them with my teammates.
So… your time at lambda’s come to a close… what next?
Such a good question, me, thanks for asking! Even though our time here at Lambda school is coming to an end, my team and I are still exciting to continue building and improving this project as we move into the job search. I still have so many ideas that I didn’t have time to implement in these two months. Right now our endpoints are deriving data from CSV’s, which will this still works effectively, isn’t as optimal or production friendly as incorporating the Postgres database. I’d also like to modify, or scrap and find new, the job data to make it mergeable to join our current city metric data table. It’s the only metric currently that isn’t consolidated and it would be a lot easier to make predictions if we had the data situated like in our other city metrics datasets. Speaking of making predictions, I’d also like to source more data and try to improve on our current predictions. Through exploration, I found out that the fbphophet model deals pretty well with spread out data, so in regards to the population data I may go back to the dicennial census to find the necessary metrics I need and see how it performs. Lastly, I’d like into a clustering model, probably K-Nearest Neighbors, to try and fit a model to predict cities similar to one chosen by the user. For example, maybe John has just graduated from college and is currently living in NY city. John loves NY city but has lived on the east coast his whole life, so he’s looking to move somewhere similar to what he already knows and loves. This model will allow him to see other cities with similar demographics and metrics, instead of going through the advanced search and constantly tweaking his preferences. In terms of Lambda, this project is complete, but our team will continue churning out good work and improving our baby… I mean, web app.
This project has been a lot of fun and has been everything that I hoped for and wanted through-out lambda, Labs has been an amazing experience. It’s very hard to duplicate a ‘working in the real-world’ setting, but Labs comes as close as you can get. Before this project, I didn’t really know how to work effectively in a cross-functional team and was very hesitant to speak up or ask for help. Now I feel much more comfortable in my role, what is expected of me, and how to communicate effectively with our front end and DS team to derive solutions in timely manner. I feel like my technical skills have drastically sharpened as well, as I gained exposure to technologies I hadn’t worked with before (AWS Elastic Beanstalk, AWS RDS, PGAdmin) and developed my skills that I was still unfamiliar with (Docker, Facebook Prophet(and working with time-series forecasting in general), building & consolidating endpoints). This project has been instrumental in shaping me as a data scientist and preparing me for the next big step in my journey. I’m very proud of my teammates and what we’ve accomplished in two short months, I cherish the time we’ve spent and look forward to updating you guys in the future with our finished product!