I thought this project was just a small few week project like any other projects I have done, but then I was wrong. It has become one of the most exciting project I have ever done. The project not only helps me to solve a problem in an interesting way at zero cost (no cost for hosting, database, and storage) but also is a good example for beginners to follow in order to create portfolios when finding internships and jobs.
If you are just curious about what problem I solved and how I built it with zero operating cost, you should go straight to PROBLEMS TO BE SOLVED section. If you are curious on how why this project is beneficial for job hunting and portfolio building, you should keep reading as it is.
Disclaimer: I was an Amazon intern and a Microsoft intern and I never work for Google and IBM. The technology chosen for this project are the most convenient for me to get the project done.
JOB HUNTING BENEFITS
+ Create a real web application that is accessible word wide through IBM cloud computing services – IBM Bluemix.
+ Gain exposure to various aspects of the Java programming language, and the Java ecosystem such as Maven automation build system and Spring web framework.
+ Work with Amazon Dynamo DB (a fully managed NoSQL database run by Amazon) while lots of big and small companies are adopting NoSQL technologies onto their stacks.
+ Use Google APIs, specifically Google Drive API and Google Sheet API to demonstrate your ability to work with complex systems, and to increase the likelihood of catching attention from an employers already using such API internally.
+ Use AWS Lambda for server-less architecture design which has a lot of benefits for employers.
+ Practice DevOps through continuous integration practice and adding alarms for internal server errors, database issues, and API issues with Amazon Cloud Watch and AWS SNS.
+ Create a framework to expand your portfolio quickly.
PROBLEMS TO BE SOLVED
Problem 1: There is a long list of company names, and a same company can appear multiple times. Unfortunately, since this list is human generated, a single company might have some numbers of name variation (abbreviation, legal name, common name, spelling mistakes etc.)
Solution: I submit company names to a search engine and use the top links as my grouping criteria.
Problem 2: Problem 1 got solved by having a small Java application and a few lines of bash script. However, such approach is not very user friendly, not portable, and not convenient.
Solution: I created a web application instead.
Problem 3: It turns out getting the data from users is quite problematic because when a data file is large – which also is a usual case:
Upload a file using a single connection is unreliable with high error rate.
Upload a file using resume-able session requires a lot of engineering effort.
The server must have enough disk space to store both the original data file and the cleaned data file, which increases the cost of this project.
Solution: Google Sheet is used as a medium to get users data and store output data. All the data will be on Google Drive, so I can outsource the storage, the reliability and the cost issue to Google.
A user starts by sharing a document with the service account, then gets on to the web application page and submits a form with the user’s email and a URL of a source document from Google Drive. The backend server running on IBM Bluemix reads the give source document, queries Bing and writes out the cleaned data to a destination document. When the job is done, the backend server asks Gmail to send out a notification email to the user and an access link to the destination document.
Java application on IBM blue mix
Spring for web framework
Maven to build and incorporate dependency
DynamoDB is a metadata storage
Amazon Cloud Watch is used to run AWS Lambda on schedule, monitor DynamoDB usage and alert for application issues
Google Drive and Google Sheet is used to get user provided data and output result data.
Google App script is used as a thin wrapper for some Google Sheet and Gmail features
There are 2 main data flows. One is numbered in the figure below and another one is lettered.
The numbered data flow is the main data flow and the lettered is the data cleaning flow.
API DESIGN – Check https://g2minhle.github.io/BingDataCleanerAPIDesign.html
TOTAL COST = 0$
As of September 23, 2016
IBM Blue Mix
You will have 375GB hour after the trial period
I chose this over Heroku since with Heroku, your app will be shut down after 30 mins of no activity.
No cost with certain limit of number call per seconds, 100 seconds.
Amazon Web Services: Free tier https://aws.amazon.com/free/
AWS Lambda: Does not expire at the end of your 12 month AWS Free Tier term.
Amazon Dynamo DB: Does not expire at the end of your 12 month AWS Free Tier term.
Amazon SNS: Does not expire at the end of your 12 month AWS Free Tier term
Amazon CloudWatch: Does not expire at the end of your 12 month AWS Free Tier term.