Democratic republic of Data Science : Building enterprise wide data science platform

Sanjeev Singh Kenwar
9 min readJul 5, 2020

Problem statement

It will be an understatement to say that Big data and Data Science (Predictive Analytics, Machine Learning, etc) are changing the world. From the domain of startups and big tech, we are seeing that Data Science is rapidly becoming mainstream across the board. This is also helped by the fact that a typical worker is becoming more technically educated, have powerful computers, and have tools (mostly free) available to experiment with. As good as it sounds, it does pose some challenges that must be addressed without inhibiting innovations. These workers are not necessarily IT People and not subject to the same rigor as an IT personnel would for developing a software. This may result in solutions popping up all over in an uncontrolled manner and use of wide variety of not-fit-for-purpose, non-scalable or non-productionalizable solutions. This is especially important for heavily regulated companies like Financial services.
However, we can think ahead and instead of putting a restriction on the innovation, we can proactively provide them with scalable technologies in a controlled manner. This article can serve as a guide on how can we go about it.

Technologies mentioned in this article:

Hdfs, Spark, Kafka, Docker. Kubernetes , Internal Cloud, Microservices, Oracle, Tableau, API , API Gateways, Zookeeper (Service registry), Alteryx, Python, Jupyter Notebook, React , Databricks

Depending on how big you want to go, some technologies can be skipped. Most of these are open source and cloud ready and the architecture should be scalable to 1000’s of users.

Our roadmap for the journey is as follows.

Vision

Our vision is that people need to be able to do heavy data analytics on big data bearing in mind data control issues. The high level steps they will follow are 1) Make data available (manually or automatically) 2) Perform computation on the data 3)Deliver data somewhere . Of course, life is much more complicated than this and we have to worry about what, why, who, where, when and how questions related to these.

MVP : walk before you run

For MVP, we will start small, so as to achieve quicker go to market with a workable solution. User will be using Jupyter notebooks to develop their analytical solutions and the user stories for MVP are as follows

  • User can create a new Jupyter Notebook project (called Notebook hence forth )through the Jupyter interface. Upon creation of the project, a GIT project (with master and development branches) , a raw data input directory, transformed data input directory and an output directory on HDFS is also created. The user can grant access to others in her group in read or write capacity. Also, each Notebook is running in it’s own virtual environment.
MVP Backend mechanics
  • User brings her own data and load through a web interface, where she first selects her project from list of entitled projects.
  • User should not be able import any libraries in the Notebook. Though in the background Pandas, Sklearn, Pyspark, Numpy and few others will be available. This is to prevent OS level imports.
  • User can commit and push the Notebook to a Git repository under the development branch auto-created for the Notebook Git project.
  • User is able to push the data to output directory specific to the project.
  • The output can be downloaded via a web interface.
  • User can also request custom transformation on raw data, which will be done using Spark (triggered by raw data load) and dropped on the transformed input directory.

MVP Architecture

Pre-Cloud MVP architecture

If you can afford solution like that from Databricks then go ahead and use it and that will take care of all user stories. Either way your solution will look like this

Jupyter Notebook in Databricks community edition
User admin screen in Databricks community edition

However, for our purposes we will be using Jupyter Hub. Customization of Jupyter Hub will be needed.

Note that for MVP, we are not going straight to cloud and we will assume less than 100 user. We will use the littlest JupyterHub on an Ubuntu server. Assumption here is that you are still experimenting. One can use the docker image of JupyterHub,which you can then spin up on local computer or your server. Once you are ready to scale, move the JupyterHub to a Kubernetes cluster on cloud in autoscaling mode. Step by step instructions are here. You can also define how much compute power each user or group of user gets. Some starter codes to automate this can be found here

The MVP architecture is good for initial release and needs to be modified to make it scalable. We want to scale this to 1000 of users, while ensuring minimal time is spent of data preparation. The next sections describes how and the tool that you would need but let us first address the question why MVP is not enough and why we need to migrate this to Cloud

  • MVP can handle only limited number of users.
  • We want to flexibly grow/shrink the size of resources it needs.
  • We want to use container technology in administering user sessions and hence making it more secure.
  • Current HDFS/Spark is overloaded with input, output and any custom spark based transformation needs.
  • User is required to do of data wrangling upfront. Whereas, most data can be directly obtained from authoritative sources reducing data wrangling burden and improving data governance processes.
  • On the output side, we are not making use of visualization tools like Tableau to provide further insight into the data.
  • Most services are clubbed into a monolithic architecture, which results into longer releases
  • It provides minimal logging.

Thinking Big : Going beyond MVP

Now we want to take our system to whole new level. Technically and simply speaking we do this by identifying relevant domains. Group domains into microservices, which then can be packaged in a docker container and deployed on a kubernetes platform and then can be scaled as needed.

Quick word on data source onboarding: To reduce data burden
In big companies applications have developed organically, resulting into spaghetti of data flows. This not an uncommon scene in a financial services organization:

source: http://hdclarity.com/wp-content/uploads/2012/06/Spaghetti1.gif

From a user’s perspective, they are incentivized to use most accessible data source for their purposes. This poses some problem as 1) It may not be an authoritative source 2) We have no visibility into what data transformation the use has done 3) We do not want user to spend time on data collection but rather on data analysis.

To alleviate this issue, a data onboarding process should be put in place where we collect metadata about each dataset (what, when, from where, any special handling) that user is planning to feed into and out of her computation . Think of a dataset as a table, result of query or a Json file. Dataset is in the context of a project but can be shared across the project to reduce duplication of data.

The input dataset will be checked against this metadata and will have same entitlement as the user. This will require some additional development work, operational procedures and understanding of what authoritative sources are there for the kind of data user is requesting. Our focus here will be system development work

Metadata Microservice:
You can use any of these tools or build your own tables on a relational database like Oracle ,MySQL or Sql server, even SQLite will do. This will be a small database. An API layer (for e.g. using Django) should be provided to perform CRUD operations . The user interface will make use of this API.

Quick word on the API strategy:
All your APIs should sit behind an API gateway to simplify authentication. There are many vendors in this space. You will also need a service registry module to register your API and can use Zookeeper for this. This will be very useful once the system moves to microservices architecture and will allow end points to change dynamically.

Quick word on how we will be using Kafka:
You can get a quick primer on Kafka here . We will be using Kafka for capturing and sending messages to various components in the architecture for things like trigger (e.g. availability of data), failures, user activities etc.
PS: Pulling large data through Kafka is not recommended . However, if you want to also use this mechanism to send data ,you may have to break this in chunks and adjust message.max.bytes settings , which by default is 1MB per message. This is old but handy post for this.

Dataset Microservices:
This will be multiple microservices one for each dataset. Each dataset microservice will consist of a Kafka producer code and codes to fetch the data, as described below.

Checking data availability from Source system:
From the metadata we know what data a user wants to bring in and from where. Kafka connectors can connect to variety of sources and can also make rest API calls. Create a Kafka topic for each dataset, which will receive messages for data availability from a producer code that is using a Kafka Connector. The produce code will make use of logic around when and how frequently to pull data , which can be defined in the meta data or custom built. The messages from the Kafka Topics can be exposed to the user interface.

Similarly, once a project computation is successfully run the message can be sent to Kafka, which then can be used to push out a dataset if so desired.

Pulling & persisting or pushing dataset:
The connectors to data sources can be custom built or use vendors like cdata https://www.cdata.com/ , which provide vast collection of connectors to pull large amount of data. The trigger to pull data & persist on HDFS or push data from HDFS to another location is received from the Kafka topic via a consumer code. Once a dataset is pulled, data can be described . A message is then sent back to Kafka with data profile and that the dataset has been successfully (or not) stored. Similarly, once a dataset is successfully pushed the message is sent back to Kafka.

JupyterHub Microservices : Zero to JupyterHub for Kubernetes
Deploy JupyterHub in Kubernetes and voilla you get resistant to failures in the system. For example, if JupyterHub fails, then user sessions will not be affected (though new users will not be able to log in). When a JupyterHub process is restarted, it should seamlessly connect with the user database and the system will return to normal.
JupyterHub can be configured to talk to HDFS/Spark clusters and access to project specific directories. Compute resources can be futher defined for each user (user group). It provides kernels for a variety of languages (the most common being Python, Julia, and R)

HDFS/Spark clusters Microservices:
As part of MVP this should already exist. Migrating to cloud should be drop in replacement. Deploying to Kubernetes is not straightforward and should be avoided , if possible. Also, the data does not have reside on HDFS, it can be anywhere else like S3.

Output reporting:
If so desired, each output dataset can be stored in a project specific relational database (RDBMS) against which Tableau can be run.

Migration to Kubernetes:
Up to this point we have broken our platform into several domain specific services. Each of these services can be containerized into docker images and deployed on a Kubernetes cluster.

Target Architecture will look like this.

--

--