Bing Data Cleaner – My Zero Cost Stack – A Way for Beginners to Bootstrap Their Portfolio.

I thought this project was just a small few week project like any other projects I have done, but then I was wrong. It has become one of the most exciting project I have ever done. The project not only helps me to solve a problem in an interesting way at zero cost (no cost for hosting, database, and storage) but also is a good example for beginners to follow in order to create portfolios when finding internships and jobs.

If you are just curious about what problem I solved and how I built it with zero operating cost, you should go straight to PROBLEMS TO BE SOLVED section. If you are curious on how why this project is beneficial for job hunting and portfolio building, you should keep reading as it is.

Disclaimer: I was an Amazon intern and a Microsoft intern and I never work for Google and IBM. The technology chosen for this project are the most convenient for me to get the project done.


+ Create a real web application that is accessible word wide through IBM cloud computing services – IBM Bluemix.

+ Gain exposure to various aspects of the Java programming language, and the Java ecosystem such as Maven automation build system and Spring web framework.

+ Work with Amazon Dynamo DB (a fully managed NoSQL database run by Amazon) while lots of big and small companies are adopting NoSQL technologies onto their stacks.

+ Use Google APIs, specifically Google Drive API and Google Sheet API to demonstrate your ability to work with complex systems, and to increase the likelihood of catching attention from an employers already using such API internally.

+ Use AWS Lambda for server-less architecture design which has a lot of benefits for employers.

+ Practice DevOps through continuous integration practice and adding alarms for internal server errors, database issues, and API issues with Amazon Cloud Watch and AWS SNS.

+ Create a framework to expand your portfolio quickly.


Problem 1: There is a long list of company names, and a same company can appear multiple times. Unfortunately, since this list is human generated, a single company might have some numbers of name variation (abbreviation, legal name, common name, spelling mistakes etc.)

Solution: I submit company names to a search engine and use the top links as my grouping criteria.

Problem 2: Problem 1 got solved by having a small Java application and a few lines of bash script. However, such approach is not very user friendly, not portable, and not convenient.

Solution: I created a web application instead.

Problem 3: It turns out getting the data from users is quite problematic because when a data file is large – which also is a usual case:

Upload a file using a single connection is unreliable with high error rate.

Upload a file using resume-able session requires a lot of engineering effort.

The server must have enough disk space to store both the original data file and the cleaned data file, which increases the cost of this project.

Solution: Google Sheet is used as a medium to get users data and store output data. All the data will be on Google Drive, so I can outsource the storage, the reliability and the cost issue to Google.


A user starts by sharing a document with the service account, then gets on to the web application page and submits a form with the user’s email and a URL of a source document from Google Drive. The backend server running on IBM Bluemix reads the give source document, queries Bing and writes out the cleaned data to a destination document. When the job is done, the backend server asks Gmail to send out a notification email to the user and an access link to the destination document.


Java application on IBM blue mix

Spring for web framework

Maven to build and incorporate dependency

DynamoDB is a metadata storage

Amazon Cloud Watch is used to run AWS Lambda on schedule, monitor DynamoDB usage and alert for application issues

Google Drive and Google Sheet is used to get user provided data and output result data.

Google App script is used as a thin wrapper for some Google Sheet and Gmail features


There are 2 main data flows. One is numbered in the figure below and another one is lettered.

The numbered data flow is the main data flow and the lettered is the data cleaning flow.





As of September 23, 2016

IBM Blue Mix

You will have 375GB hour after the trial period

I chose this over Heroku since with Heroku, your app will be shut down after 30 mins of no activity.

Google API

No cost with certain limit of number call per seconds, 100 seconds.

Amazon Web Services: Free tier

AWS Lambda: Does not expire at the end of your 12 month AWS Free Tier term.

Amazon Dynamo DB: Does not expire at the end of your 12 month AWS Free Tier term.

Amazon SNS: Does not expire at the end of your 12 month AWS Free Tier term

Amazon CloudWatch: Does not expire at the end of your 12 month AWS Free Tier term.



Nowadays, technology developments have made our lives more comfortable than ever before. The big question about the most significant scientific innovation or social discovery is still challenging, and many people may have electricity, light bulbs, computers, Internet, or Evolutionism as answers. People, however, should realize that games and the creation of game industry have many important contributions to the entire human being since the very beginning. In other word, human have gained a lot of social and technology benefits from games.

Taking social benefits into account, games have solved different types of problem challenging not only experts but also governments. First of all, games are activities mainly for entertaining people, so it means that games help people release modern life’s tensions, prevent stresses, and increase labor productivity. In medical sector, according to the Games for Health Conference hold in Boston-2009, games improve not only the balance of neurodegenerative disease patients improve, but also the way people interact with others. Therefore, with many applications, game industry is also a multi-million industry attracting attention of many entertainment corporations. Additionally, because computing and graphics power increase, the number of staffs and developers needed to address the ever increasing complexities also rises up. Thus, another contribution of  game industry to the community fewer white collar workers will be fired during economic downturns, so governments feel relieved about the unemployment rates.

With the respect to technology benefits, there are many more obvious achievements thanks to the development of games. In fact, it is games in general and dice in particular that led to the first recorded abstract ideas about mathematical study of probability and counting, a very important subject of modern days. However, in order to understand how game industry has changed our life, modern achievements must be examined carefully. Soccer stadiums as examples, because of the dramatic increase of fans, engineers have to develop better technologies to build up much larger stadiums. Surprisingly, those technologies afterward are applied into multi-story buildings around the globe to meet the demand of accommodation in megacities. Furthermore, everybody knows gambling, but there are few people realizing that behind those games are sophisticated cheating-prevent systems, which can easily be applied as effective crime detecting systems. Game industry has also changed the computer graphics technology. Computer games began in low resolution games such as Mario, but because of the high demand of better entertainment it finally led to DirectX version 11 supporting 3 dimensions games with fascinating effects.

In conclusion, even though some people say that playing games is just for entertaining, the truth is that games are playing a vital role in human development. Because of both social and technology contributions of games in particular and game industry in general, the answer for the big question must be games, all types of game.

M.K.M 🙂

I am posting it here on behalf of a team of 3 people .

Review Board – UCOSP 2016 – Recap


It is time for another recap! It has been an honor to be a part of the Review Board (RB) team for the last 4 months. The journey was not much longer than any other adventure I had before, yet there are numerous things to talk about. After all, this was my very first time working on an important feature for the next release of an open source project.

Early start – early issue with an error on first trial

As a RB user myself, I was very excited to be a part of the team and tried to follow instructions to set up the development environment during the Winter break, and that was when I discovered my first issue. RB has Djblets as a dependency and both repositories are maintained by RB core developers, and a core developer bumped the required Djblets version up in the RB repository without updating the Djblets version information. That resulted in a compilation error while building the RB. Since the error message indicated that there was a missing dependency, I did not know which step I missed. Also, when I was in Microsoft, no commit could have been landed if it broke the build. Hence, I did have an assumption that the code in the master branch must always be compliable. I ended up looking into the commit logs and found out the issues later on. Even though the issue got resolved quickly without a need to create an issue ticket on GitHub, a nightly build and a commit guarding system would have prevented it and helped junior developers’ on boarding experience happen with less hiccups.

Virtual machine (VM) all the way with Vagrant

There is no mandatory development environment, so I chose to use a VM because I can make sure that my development environment is clean and closely resembles the deployment environment. Given those advantages, developing in a VM has its own issues when it comes to editor and IDE selection. That was when I discovered Vagrant, which enables an almost seamless Unix development experiment on Windows. Vagrant gives you the ability to edit your code in Windows with a Unix terminal ready for action and makes sure your local folder is completely in sync with a folder inside the VM. As a result, you can use any editor and IDE to edit your code, as well as a terminal through Putty or Cygwin for shell commands. Personally, I think this is the best of both worlds: I can use all Windows advantages without losing any development power. Note that Vagrant is also available on Linux and Mac as a self-contained development environment. Visit to learn more about Vagrant. Even though HashiCorp – the company behinds Vagrant – is working on Vagrant’s successor: Otto ( ), Vagrant is still a battle-hardened product.

Unfortunately, there was a small issue when I started. A bug in Vagrant makes it is impossible to build the development environment using Ansible ( ). It is a well-known bug in the latest release of Vagrant and will be patched in the next release. In a situation like this, the power of open source is clearly demonstrated. Following some instructions on the Internet, I went to the Vagrant installation directory and changed a few lines of ruby code. This solved the bug. Without the ability to dive deep into the code base of the product I would have either chosen a different method for development or postponed the project until the next release of Vagrant.

OAuth2 protocol – From the other side

My project was to implement OAuth2 authorization support for the RB web API so that external services can use OAuth2 as a mechanism to obtain authorization and use the RB web API.

OAuth2 is an open authorization protocol that allows users to share private resources without revealing all of their identification data. Oauth2 client applications are external services or programs that want to access on behalf of users. A typical use case of the OAuth2 protocol is logging in to a third-party application through a Google account or a Facebook account, or executing a task on a user’s behalf.

Since OAuth2 is not the main topic here, I will only discuss briefly about the very basic flow of the OAuth2 protocol in the RB context. First, a developer must register a new application on RB with a name, an id, a secret, a client type, an authorization type and a redirect URI. When a user wants the developer’s application to execute certain tasks on the user’s behalf, the user goes (or gets redirected by the developer’s application) to RB with the specific client id, redirect URI, authorization type and a scope (this limits the ability of what the developer’s app can do) and gets authenticated. RB asks the user for a final confirmation before sending the access code to the developer’s application. With the access code, the client application can use its client secret to obtain a token from RB. Subsequently, the client application can use that token to act on behalf of the user.

Balance between do it yourself and out sourcing

Of course, I did not implement the entire OAuth2 protocol from scratch because it is obviously not wise to reinvent the wheel, especially when it comes to security features. My project heavily relied on the Django-OAuth-Toolkit project ( which provides all of the necessary models, endpoints, forms and logic. Although the package conveniently has its own client apps listing page, registration page and delete confirmation page, those pages are not user friendly for a final product. During the first three weeks of the project, I was forcing those pages onto RB which resulted in a very bad user experience with a large number of page reloads and a confusing work flow for users. Hence, I decided to recreate all OAuth2 management pages and APIs, not only to make the feature become a part of RB but also to remove unnecessary steps and redundant page reloads. I also had to implement the authorization mechanism to map the right scope to the right resource and follow the RB’s original design. Fortunately, I managed to reuse the authorization logic to issue and revoke tokens. I am not a 100% confident on my decision. Only time can tell.

Moving away from Continuous Integration

Review Board has multiple branches that get developed at the same time. A contributor can make local changes and submit code review requests. After requests are approved, the core team is responsible for merging into branches depending on the nature of patches. The RB core team is quite strict about what code get landed, so all of my changes were stopped as requests for the whole 4 months. It is completely opposite from the continuous integration (CI) mentality; this makes merging issues unavoidable. Although RB is a website with live services, at the very core, it is a packaged software: continuous integration does not work here at all. This is my very first time working with a new development model and truly understanding terms like “integration hell”.

Demo video – Practice makes perfect

I am not new to amateur movie making, live demos and presentations. I thought demo videos could be done easily, but without direct interaction with the audience the task is actually quite different. As my focus was on the smoothness of the demo, it was impossible to make it in one continuous run.  In the end, I developed a method which recorded small scenes, some were even less than 10 seconds. Then, all scenes are merged into one continuous video demo. With some practice, I have reduced my demo making time from 8 hours for the first demo to only 3 hours for the second one.

Too small to talk about – too big to not mention

Even though using Vagrant gives you the best of both worlds, it also introduces issues related to subtle differences between Windows and Unix-like systems. In Windows, newline characters consist of 2 bytes: line feed (LF) and carriage return whereas Unix-like systems have only one newline character: LF. See for more information.  This subtle difference is picked by Git and confused as changes, or prevent bash from directly execute a file with #! in the beginning. dos2unix can be used to correct these issues, yet the root cause remains unchanged.

Given that I spent most of my time in the service world, the concept of a dumb front end with all of the logic is handled by models and controllers is not new to me. However, when I was implementing OAuth2 client management pages, by trying to copy and modify the existing code base, I ended up with a Frankenstein approach that lay far away from the above principle. Thanks to a discussion with a mentor, I adjusted my implementation according to the principle, and got a shorter and cleaner piece of code. From this small incident, I learned that sometimes you can get lost in a giant code base and forget about the design principles that certainly make your life much easier.


The journey has ended for UCOSP Winter 2016, but it is just the beginning of being a part of RB in particular and the open source community in general. Once the OAuth2 authorization feature is a part of RB 2.6, it will always be under my watch. I hope I will have time for RB after my graduation so that OAuth2 authorization is not the only feature that I contribute.

Microsoft Intern Summer 2015 recap

This is the end of, probably, my last internship. The internship was just 12 weeks this time, but it gave me quite a number of new interesting experiences.

A journey to versioning

To begin, I will start with a technical challenge. I was given a task to create a new object that is a part of the current API, but the object versioning should be decoupled from API versioning. There are some approaches that available, and some of them are already parts of the code base.

  1. The 1st option:  Every version is a separated class. This is somewhat similar to current API versioning approach. Sample C# :
    1. Pros
      1. Obviously this is the simplest model. The object can either have dependency on earlier versions or be completely independent
    2. Cons
      1. I am not sure how we can really do versioning that support multiple versions at the same time with the approach
      2. There is no high abstraction for the object. Therefore, business logic code always needs to know the current supporting version even though sometime business logic code does not need to know such information. This will result in a lot of code refactoring when a new version of the object is used.
  2. The 2nd option: Have a base class that contains common properties and versioned classes which are children of the base class. Sample C#:
    1. Pros
      1. Versioning now can be achieved by checking the type of the object
      2. We now have an abstraction on the object so that the business logic code does not have to care about the version of the object when the object version does not matter.
    2. Cons
      1. One major issue is that there is a strong dependency on the base class. In order to support the both old and new version at the same time, there will be some unused properties from the base class that might be able to be accessed from the latter version
      2. Another issue to this approach is that there is no structure to support the separation of consuming logic for every version of the object (all adhoc). As a result, a big if else or case statement will be required when consuming the object. Depend on future developer ability, the object consuming logic might also be scattered around due to this lack of supporting structure
  3. The 3rd option: Have an almost empty base class, with versioned classes are children of the base class. Also embrace the visitor pattern to handling versioning. Sample C#:
    1. Pros
      1. Same as the second approach, there will be an abstraction on the object to simplify the business logic code and to reduce refactoring effort when getting a new object version.
      2. Versioning can be achieved as well, but with a different approach that will reduce big switch statements and force developer to have all consuming logic very close to each other.
      3. Similar to the 1st approach, dependency between versions of the object is optional which allows much more flexibility and cleaner implementation
    2. Cons
      1. Obviously, this approach is not very straightforward: it requires some extra effort to understand and figure out how to use effectively
      2. The need for a consumer class and limited interaction between consumer and the outside world is a big limitation of the approach.
      3. Writing test cases involving consumer can be challenging

In the end, I decided to go with the third approach because all it pros will support the operation in the long run, the code base will be much easier to maintain with minor overhead when creating consumer. Moreover, I also believe this design embrace micro services design:  A consumer class can be scaled out to be a service; an object version can be scaled out to be API version.

Demo god is a bitch

I found out the hard way that the Murphy’s Law – “Anything that can go wrong, will go wrong.” – really holds. The first 10 minutes of the demo was smooth with the slides. However, when we were really doing live demo, issues started popping up. At first, a sign out is enough to solve the issue but issues escalated to the level that we cannot even access the machine remotely. That was not the only problem, after we brought the server up, my laptop could not connect to run a shared file on the system. Only when everyone left the room, I managed to get the demo working. I guess from this point onward, a video recording of the demo is always a must for backup.

Others minor things

Apart from major lessons and interesting exposures, I also picked up on the way a new way to organize all my knowledge in OneNote. I learned that when you mock something, better mock it as real as possible: only till very near the end, we realize that is what we should do to do impersonate correctly due to extra hydration. Exception in multithread can cause some confusion when debugging: you are debugging thread A, but exception happens in thread B will halt thread A execution; this makes you think the problem is in thread A.


I got back right before Microsoft release Window 10, very interesting and full of pressure in time, especially when my code will be in production. It really put what I have learned in the last year to the test. Despite some hiccups on the way, in the end, we got the feature up and running; the intern was a success in very aspects!

Uncle Ben’s toys – 2015 Christmas Season Store Design


We are Uncle Ben’s toys! Our company focus in providing high quality toys for kids under 7 years old. The webstore is designed to be children friendly, yet trust worthy for parents to make purchases.



Company logo is designed to be kid friendly so that a child that have not learned to read yet should be able to recognize the green cartoon face, even from far away. The logo uses Tetradic color scheme to attract attentions of both young children and parents, and Comic Sans MS
is used for the banner above the cartoon face to imitate a child hand writing. However, the main focus of the logo is still the cartoon face, so the cartoon face only is still consider the company logo. The full logo is designed so that a new customer can easily grab main idea about business, but the cartoon face is what we want people to be familiar with.

Similar to the logo, the navigation bar and side menu also use Tetradic color scheme. The color scheme helps the webstore achieve a unique amount of visual balance while stimulate young children’s eyes. This color scheme selection plays an important role in creating a playful sensation for the store. 

The list of categories consists of both text and picture so that children have not learn to read can still recognize a category. We are expecting those pictures can grab kid’s attention and curiosity, so that either their parents or kids themselves will open one of those categories for further exploration. These pictures are chosen so that they are simple to make the page looks clean, but complex enough to convey the idea.

Segoe UI is chosen to be the main typeface for the store page. The typeface is used to provide a friendly yet professional webstore to parents so that parents can be confident in making purchases here.

The webstore background has a gradient from high to low saturation of yellow. The goal is not only to avoid traditional corporate feeling of white page, but also provide a warm feeling similar to Christmas fireplace.

Since we are designing the store landing page for Christmas – the season for buying gifts, the spotlight image clearly indicates there is a 25% sale program, and this should attract parents’ attentions when looking for Christmas presents to look further into out webstore. The color dark red of the spotlight image suites the site the most because it is the main color of the season and it greatly contrast with the light yellow.


Project Soli for recording guitar tabs


Task description: When writing new songs, or coming up with new piece of music, guitar artists want to capture what they came up with quickly because it is easy to forget. It is also important sometime to record the hand gesture on the fingerboard so that not only the artist but someone else can also easily replicate a piece of music.

Objective: Quickly record guitar tabs in case guitar artist forget what they came up.

Current methods:

Taking notes

When an artist wants to record a guitar tab, the artist can grab some paper and pen and note down

Issues: The act of grabbing paper and pen is very distractive, it can disrupt the creative flow. A guitar artist sometime has to put the guitar down properly before the artist can write down something. Therefore, taking notes might be the simplest way, but it is very time consuming and distractive

Sound recording and sound detection software

A guitar artist can record a piece of music then feed the recording to software. Then the software can listen to the song and try to come up with the music sheet.

Issues: With current technology, the accuracy when converting from sound to music note is the biggest limitation. In fact, there is no popular software that has a decent accuracy for sound to note conversion. Using this method also requires good recording environment with little or no back ground noise .Also, guitar is an instrument that can produce the same sound using different methods. Hence, recording a guitar tab with this method is very unreliable.

Embedding sensor into guitar

A guitar can be modified to have touch sensors or buttons on fingerboard to record the tab.

Issues: This is considered as a very intrusive method since it might destroy the guitar, especially which are made from high profile guitar builders. The method also requires a large number of sensors, so the cost of having a guitar which can record music tabs is very expensive.

Guitar structure: This can be used for reference of guitar terminology


The solution consists of:

  • Soli recording device: 2 Project-soli-sensors to be built into single device for higher accuracy.
  • A companion software to interpret the data

How to record guitar tabs with Soli recording device.

  1. One time setup

    This step only need to be done once for the first time only

    Guitar artist needs to download and install the companion software

  1. Setup and connect Soli recording device

    Put Soli recording device roughly on the same high with the guitar

    Connect Soli recording device to the device running the software

  1. Calibration

    Press the calibrate button on the Soli recording device

Pick up the guitar and sit in front of the Soli sensor so that the Rosette faces one sensor.

Wave hand above the Rosette of the guitar for the sensor to calibrate the hand position. This action helps project Soli’s sensor know where is the Rosette so that latter on Soli recording device can track artists hand movements at the Rosette. The software will tell the if the sensor got calibrated.

Drag the other finger from top to the body of the guitar so that the second sensor can calibrate. This action helps project Soli’s sensor know where is the fingerboard so that latter on Soli recording device can track artists hand movements at the fingerboard. The software will tell the if the sensor got calibrated

  1. Play

    Start playing. Since project Soli’s sensors now can monitor hand gestures, the Soli recording device will record all guitar tabs into the software

  1. Save and share

    When done, save the recording to disk and share with friends


High temporal frequency

High positional accuracy

Light weight and portable

Since project Soli sensors can respond quickly to any gestures with high accuracy, Soli recording device can correctly record all moves of an artist while playing. Artists can record any piece of music without pausing or putting down the guitar. Hence, Soli recording device is a none-disruptive aid to an artist creative flow.

Because there is a calibration step and the ability to accurately detect hand location, the Soli recording project can be used for different type of guitar without damaging the guitar. Project Soli’s sensors is also small and compact so it is easy to move the device around for different locations and purposes (ranging from studio recording to on stage live recording)

The bad, the better and the might be the best of Calendar app



With busy life nowadays, it is very difficult to memorize accurately all appointments and plans, so a calendar is a necessary tool for time management. A lot of people are managing their time through a phone app, smart phone specifically. Users want to have a simple way to check their schedule and set up new appointments. They are also looking for a good method to plan their activities to have a good work life balance.



Microsoft Outlook app is a personal information manager running on Android that includes a calendar. The app has been downloaded more than 10 million times from Google Play store. Since Microsoft Outlook app is made by Microsoft Corporation, an American multinational technology company headquartered in Redmond, Washington, it is well integrated with the Microsoft office suite. Hence, most users of the Microsoft Outlook app are using other Microsoft related products.


After opening the calendar, users can choose to see the appointments in two views: in chronological order or a list of appointments with their start and end times. To switch from one view to the other, users have to tap on the top right corner and select either “Day” (chronological order) or “Agenda” (list). In the “Day” option, users can view appointments for a specific day (with default being “today”), or appointments for groups of three consecutive days (in landscape mode). The “Agenda” option shows all appointments in the calendar, but only a limited number of those can be displayed depending on the screen size.


The “Agenda” mode serves well as a laundry list of appointments. However, it is visually challenging for users to see the overview of how their schedule would look like in the near future. It takes quite a lot of time (for certain users, it may be impossible) to know on what day that they would have a lot of free time to perhaps squeeze in a few more activities, or what days that they would have to run errands. As a result, it would be inefficient (and perhaps ineffective) for users to plan their activities.



Screenshots of Microsoft Outlook app, the calendar


“Day” mode:


“Agenda” mode:




Switching from “Day” to “Agenda” mode:



Switching from “Agenda” to “Day” mode:





Google calendar app is also available for download on Google Play store. There are already more than 100 million downloads for Google calendar. The app is the default calendar app for most Android phones. Google calendar is developed by Google Inc., an American multinational company specializing in Internet-related services and products.


By default, the app shows all the appointments in the week in chronological order. Users can switch between different modes by tapping the top left corner and select from the list of available options. Users can view their appointments in a specific day, a group of three consecutive days, for the whole week and for the whole month.


Since the app gives users much more flexibility on how they would like to view their schedule, users could effectively use the app to plan their activities. That is, they could be reminded of the appointments in the near future to avoid missing any of those. At the same time, they could plan ahead of time for activities that would not occur in weeks or months. Ultimately, they could easily visualize their schedule in an extended period of time. As a result, users would know when to squeeze in a meeting without taking a risk of not making it due to the short break between that meeting and the preceding appointment.


Screenshots of Google calendar app


Select display mode:



View “Schedule”:


View “Day”:



View “3 Day”:



View “Week”:

View “Month”:





Although the Google calendar app already provides a large set of options to users, a smooth, continuous and natural transition between different modes is still missing. Users still have to open an option menu to select a view of their calendar.


The app might be able to support users better if they could use the zooming gesture to switch between different modes. Normally, a person would “zoom in” to get more details or focus on a specific matter and “zoom out” to achieve the contradictory effect. Therefore, users would naturally zoom in to view detailed schedule of a specific day or group of days and zoom out to have a bigger picture of the current week or month without disruptive menu selection.



Designing a calendar app should focus on giving users the ability to quickly go through specific events as well as high-level overview of their calendar. A calendar app should also be easy to navigate and intuitive for non-tech-savvy users.

Amazon Intern 2014-2015 recap

So, another internship is now over. It is not the end of a year; in fact, it is just the beginning of a new year, but it is the end of a journey. A journey that changes me for good.

The journey started in the last week of September, 2014. It wasn’t easy just going back to work like that, I had way too much fun for most of the September in Canada. It had to start, however, there was no way out for me.

The internship started !

Even though, I was told that most of the development will be done in a Ubuntu box, I am still surprised by the fact that the real development happens inside a RedHat distribution in a VirtualBox on Ubuntu box. I felt trapped with such hybrid environment; a lot of useful keyboard shortcuts now are useless. “Have to suck it up” was my mentality for a very long time. Before you know it, you already get use to it. After a while, it becomes normal. I, however, still see developing inside a VM somewhat inconvenient but it is no more a big problem.

Testing ! Unit tests ! Integration tests ! Tests everywhere. I believe that I wrote as much tests as business code. Writing tests right after application logic code seems very weird. It is like asking  yourself a question out loud every time you say something. Tests, however, are not really for the present, those are investments for the future. Those tests will be the only way for you to keep moving forward fast without breaking anything. Not only you have to write test, you have to test the right thing: test what can go wrong, not what will go right; I guess this picture captures the idea perfectly

SUCCESS: 26/26 (100%) Tests passed

Embedded image permalink

source :

Testing is not the only thing that slow down the development process. Code reviews (a.k.a CRs) are huge pains.You not only have to do them right for your own benefits, but also have to do them right for reviewers to review them. In the beginning, my commits and CRs were huge; I guess because it was the bootstrapping period. Of course, there was some of my fault trying to achieve perfection with EmberJS. I then cut my commits size into individual functionalities, then my CR queue started to grow. The queue reached 10 CRs within a single days. Together with a busy schedule of my mentor, the queue kept being big for more than a week. Surprisingly, my mentor spent just one Friday to went throughout all of them. Dealing with multiple commits means multiple branches and tons of rebase; together with a week code reviews cycle I unconsciously adopted the model of having 1 to 3 CRs per week to minimize the number of rebase I need to do. This bad CR habit and bad planning created 1 CR huge per week with large amount of code got disposed. It took almost a month on a single card: get data from backend, because this card covers the model layer, the translation from business model to API model layer, the client layer. I should have done what my mentor told me afterward: send a CR on what you want to review on, do it small and early avoid wasting effort and get early feedback. I also got another recommendation that to cut down CRs size and to forget compilable code. It was a good idea at first: I got tons of ship-its (reviewer might go easy on me) , but then it showed its weakness. If there is a change needed for a CR, I have no idea if the fix works, I cannot compile the whole thing. Near the end, I guess the project was pretty much done, so it was more about add on features. My CRs started to be small and compatible. I guess I now see and a good feeling on how to do good crs.

Through this lengthy CR process, we not only dealt with code quality problems but also architecture design problems. I learned there is a clear separation between service dealing with machine and dealing with human. Even though data duplication and latency might be problematic, but it should be much more acceptable compared with a costly one fit all solution. Those problems can be solved with messaging systems and event driven programming. Never before, I implemented a front end with EmberJS completely separated from the backend: the only way to communicate between them is through the public API. DynamoDB being managed by AWS team was used to store persistent data. I used SQS for our messaging system and CloudSearch to handle all search operations, they are all separated services. Surprisingly, this is the micro-services model, and it fits with the Unix philosophy: Write programs that do one thing and do it well. Write programs to work together. Write programs to handle text streams, because that is a universal interface. Having small services ensures low number of components with different access patterns, and this help scaling much easier and avoiding race conditions on the big scale. Even the java code itself is written with micro-services in mind, there are quite a number of interfaces for main operations: model translation , authentication, access control management, database management.

Talking about technology, I did have an interesting time getting started with EmberJS. At the heart, EmberJSS is an one page application framework. This means there will be very minimal page load while using the app and it seems that EmberJS will load all the templates only once at the beginning. Most of the tutorials, hence, instructs me to put every templates in the index.html page, and this is outrageous for me: maintaining the application will be impossible. At first, I tried to use ReuiredJS to load the templates asynchronously, and it worked, at least for basic static pages with not a lot interactions. Unfortunately, this creative movement hitted a big wall because this way is too hard to follow, it destroys the meaning of using Ember : conventions over configurations. As soon as my first huge CR was reviewed, I got a request for simplify the model. This getting started with EmberJS costed me almost a week ,and once again keep it simple stupid principle proved its value. I ended up with legacy XML blocking request to load all the templates when page first load. In nature, apart from making a few more HTTP requests and having a much more maintainable piece of code, everything should be pretty much the same with this approach. I know there is EmberCLI ( that will combine all the templates in to an index.html file, but Amazon built system haven not integrated it yet. I hope in the future I can do it in a less hacky way

Throughout the time at Amazon, I beside core technical skills and processes, I did get exposed to the business side of company operation. I learned that variable naming is a very hard communication problem that need to be both concise and descriptive. I learned technology is just an add on for business: developing a cool new technology without business value still means useless contribution for now. I learned that even some tiny changes to a presentation can make it becomes a much more professional presentation: uniform font size, pictures without saying should not be there. I learned that a good manager should pay close attention to you; a good manager should treat you as an individual not a number.

It was an amazing journey! You might find that I talked about testing and code reviews as things that slow down the dev process, they do! However, those are what differentiate production code and an experimental code; saying that, it was nowhere near perfect i wish i had more time to setup metrics and alarm . Amazon in general, kap-adx and my mentors in particular gave me an operational oriented development process. It is about maintaining and improving in the future, not just working product at the moment.

Leaders are owners. They think long term and don’t sacrifice long-term value for short-term results. They act on behalf of the entire company, beyond just their own team. They never say “that’s not my job.”

Ownership principle from Amazon’s leadership principle