Metaflow: Netflix’s Open-Source Data Science Framework

July 23, 2020

Metaflow, a Python framework developed by Netflix for data science was released as an open-source project last year, December 2019.

The developer team at Netflix mentioned that they have been using Metaflow for more than two years. The Python-based framework is used in conducting various types of data science projects for real-life used cases ranging from operations research and natural language processing (NLP).

Netflix developed and nurtured the framework in an attempt to boost data scientists’ productivity while working on multiple ranges of projects ranging from deep learning to classical statistics. Having a unified API above the infrastructure stack, the framework makes itself useful for every stage of the data science project – from prototype to production. The major reason why it is said that Metaflow is much more useful to data scientists as compared to machines.

Typical features of Metaflow

The Python-based data science framework addresses some of the major challenges data scientists face concerning the version control or scalability. A processing pipeline is developed as a sequence of steps in a graph. The framework makes it easier to move, from running the pipelines from a local machine to cloud resources (currently, only AWS uses it).

Create models using top tools

Data science libraries like Scikit Learn, PyTorch, and TensorFlow makes Metaflow write models easily with a typical idiomatic Python code that does not have a learning curve.

Developing with Metaflow

The framework accelerates the design of the workflow to increase its scalability. Besides this, it also helps in maintaining the versions while keeping a track of all the ongoing experiments without hassle.

Powers up AWS cloud infrastructure

To boost the scaling process, Metaflow helps with the inbuilt integration of AWS cloud services for machine learning, storage, and computing. The best part about the process is that it does not require any kind of change.

Developed to help data scientists

With the help of Metaflow, data scientists can put their focus more on extracting the value of data science available for user context rather than wasting their focus on the engineering front.

Netflix: intertwining Metaflow with the data science infrastructure stack

Ideally, the models in data science projects play just a minute part. Most of the projects depend on the infrastructure stack. And now, with Metaflow’s help developers and data scientists can easily cover every layer present in the infrastructure stack.

Access of data can be done from the data warehouse, the warehouse could either be a folder or a petabyte storage data lake. The modeling code helps execute data in a computing environment. The job scheduler ensures the orchestrating of the process is properly executed.

Further on, the developer team structures the code placing is as an object hierarchy that can help the execution of the codes. These object hierarchies could either be Python modules and Python packages. This is when machine learning registers the code version and inputs data. However, once the machine learning deploys to production, questions like how can we keep the code’s performance reliable or how can we crack the code’s performance?

Thus, Metaflow was developed to address such concerns. The framework offers an extensive and comprehensive approach to managing the stack. Developers can utilize Metaflow with all the machine learning libraries available such as PyTorch, TensorFlow, and ScikitLearn.

Metaflow as a framework stands out producing built-in features used for building and deploying data science workflows. The other features include,

Taking care of external dependencies
Management of computing resources
Full in-charge of version control, replay, and resume of workflows
Carrying out containerized runs
Navigating back and forth between local computers and cloud servers based on the execution demand
Making use of client API to evaluate past runs

The Python-based data science framework also captures the snapshots of the data, the code, and dependencies automatically in a data store that is backed by S3 despite the fact of it supporting the local filesystem.

Doing this helps in starting the workflow once again, recreates the past results, and evaluating the workflow using a notebook. Thus, boosts the productivity of data scientists and developers.

Metaflow is now available as an open-source data science project for anyone looking to explore and utilize it to the fullest.

(Visited 125 times, 1 visits today)