Welcome to Part 5 of an ongoing series with ShuttleCloud engineers discussing projects, bottlenecks, and whatever else is on their mind.
Below is an interview with Félix López, Senior Software Engineer for ShuttleCloud who lives in Salamanca, Spain.
At a Glance
Name | Félix López
Role | Senior Software Engineer
Background | Mobile games, financial software, hardware
I studied Computer Science during undergrad and received a Masters degree with a focus in researching topics such as Fuzzy Logic, Neural Networks, and Multi-Agent Systems.
I started my career in web development before moving to software in the currency exchange market, where there were a lot of new security challenges. Later I spent 4 years doing an IDE to develop games for hundreds of different mobile device OS variations. Before joining ShuttleCloud I spent 2 years working on applications with sensor networks, Arduino, ZigBee and custom hardware. One example is an application that detects the need for street light utilities in major cities based on existing atmospheric brightness.
What’s your role at ShuttleCloud?
I’m responsible for the platform, which is the core software that we use to migrate data. On a daily basis my job is to build the platform, and tasks range from integrating new endpoints with other services to designing the architecture for scalability.
How have you designed the platform?
We designed the platform so that it could work with any kind of resource.
Basically it defines a protocol, and if an end-user wants to integrate two endpoints, it has to fulfill that request. These endpoints are responsible for retrieving data from a service and sending it to the platform. The platform, then, is the arm that orchestrates how much data the integrated endpoint wants and the speed of the system. In essence we can have as many endpoints as we wish; put simply, data is taken from a source endpoint and migrated to the destination endpoint.
Our platform also has intelligence, which determines if a destination is receiving data slowly. When this happens our platform is notified and it stops requesting data from the source until the destination can respond. After a destination confirms successful receipt of data, the platform takes more items from the source. We only request data as fast as the destination needs it, and by doing it this way an end-user’s data is always in either the source or destination endpoint, never in our servers. This prevents a lot of security risks as well as storage expenses.
Where are the bottlenecks?
Moving data is difficult– you need bandwidth and good throughput in all your pieces. That means all your pieces have to keep up with the pace, and you need to have a recovery system for every component.
Endpoints are often the primary bottleneck because neither a source nor a destination can work as fast as our platform. In many systems storage is an important bottleneck, but not in our case. Since we only store the plan for a migration, we make very light use of servers and avoid this otherwise typical issue.
Describe a project you’re proud of.
ShuttleCloud’s platform is the most beautiful piece of software I’ve worked on in my entire career. It’s built on top of Spring Integration, whose role is to orchestrate the migrations processes as a whole. In order to do its job we uses RabbitMQ for internal messages, Redis for intermediate operations and MySQL as a resource tracker so we know the status of resources at all times. This lets us stop or resume a migration, retry only the failed resources, and so on.
Our platform also supports failover, which means that if a component goes down the platform will retry periodically until the component recovers. To do this we made all operations in the platform idempotent, which means that if an operation fails we can repeat it and ensure that nothing wrong will happen to existing or completed tasks. In addition, no matter how many times an operation is repeated the results will only happen once, which is critical for avoiding duplicates. It took 4 months to add this feature to the platform and we’re very happy with the results.
Another approach that presented a bottleneck at first was using MySQL to store all the data. It simply wasn’t fast enough, and while we tried several approaches such as sharding, the incremental complexity to the system due to the need to backup every shard was too much. We then tried different database managers, and finally we decided to change what data we stored. Ultimately we divided data into temporal data and permanent data, and that’s when we learned about Redis. We started using it immediately, and with that change alone we doubled our capacity and we are far from exhausting the scale of the MySQL and Redis nodes. An added benefit here is that we can scale the capacity of our system simply by spinning up machines, not writing new code.