Six ways to improve data science in the cloud

Holographic representation of cloud computing over open businessman's hand

(Image credit: Shutterstock)

The data science market is flourishing, with an increasing number of companies across sectors placing data at the forefront of their digital transformation strategies. The rise of data analytics has seen the demand for data scientists and data engineers tripling over the past five years, rising by as much as 231%. Yet as many businesses rush to hire the talent they need to make their plans a reality, many are still on a journey in realizing the full value that their data can offer.

About the author

Julien Alteirac is the Regional Vice President of UK&I at Snowflake.

Organizations that previously used legacy architecture often face challenges when attempting to retrofit their systems to the cloud. As a result, it can be difficult to adapt, and habits and biases from the on-premises world can limit the understanding of what’s possible in the cloud. Data scientists, data engineers and developers are all having to acclimate to their new cloud environments and a rapidly evolving ecosystem of tools and frameworks. With many learning on the job, businesses risk not maximizing the potential of their cloud architecture.

If harnessed correctly, the cloud can revolutionize data science and create an exciting frontier for companies to better understand customers, monetize data in new ways, and make predictions about the future. Data teams now have access to a vast pool of elastic computing power, as well as numerous sources of internal and external data. Managed cloud services are also available to reduce the complexity of building, training and deploying machine learning and deep learning models at scale. Here are six tangible strategies that businesses can follow to make the most of data science in the cloud.

1. Don’t compromise on data governance

It’s critical for businesses to enable iteration and investigation into data without compromising governance and security. Before they start working on a dataset, many data scientists intuitively want to make a copy of the original. But all too often copies are made and forgotten about, creating problems when it comes to compliance, security and privacy. A modern data platform should enable data teams to work on snapshots, or virtual copies, without needing to duplicate entire datasets, all while maintaining fine-grained controls to ensure that only the right users and applications have access to the data in hand. Businesses must create processes that minimize duplications to ensure internal and external data governance policies are being met.

2. Start with what you want to achieve

Pre-existing biases from operating on-premises can often hold businesses back and hinder them from focusing on what they wish to achieve with their data. For example, one common misconception is when data scientists say: “I’d love to retrain my model several times a day, but it’s too slow and will delay other processes.” But that’s not an issue in a world of elastic infrastructure. When migrating to the cloud, it’s therefore important for companies to recognize the full breadth of new capabilities on offer in order to dispel any previous biases that are no longer relevant within a cloud model.

Removing any preconceived perceptions will empower businesses, to make the most of their data and be ambitious. Once in this position, data teams must start with what they want to achieve, not what they think is possible, and move forward from there. That’s the only way they can push boundaries and take full advantage of the cloud.

3. Create a single source of truth

Closely tied to data governance is the concept of silos – these occur when data sits separately from each other, meaning no one person or team in an organization has a holistic view of all the data in its possession. The proliferation of tools, platforms and vendors is great for innovation, but it can also lead to redundant, inconsistent data being stored in multiple locations. Another cause of fragmentation is when structured data is stored in one environment, such as a data warehouse, while semi-structured data ends up in a data lake. Besides compromising governance and security, this fragmentation can get in the way of achieving better predictions or classifications.

To tackle data silos head on, organizations should work with a cloud data platform that provides a global, consolidated view of their data. That means a platform that can accommodate structured, semi-structured and unstructured data side by side. It also means a platform that can provide a single instance of this data across multiple cloud providers and tools — not six versions of data replicated across different platforms and environments.

4. Exploit new tools and technologies

One of the exciting things about data science is that frameworks and tools are evolving at an incredible pace, but it’s critical for businesses to not get locked into an approach that limits their options when technologies fall in and out of favour. To give one example: Spark ML used to be the answer to most large-scale training problems, but now TensorFlow and PyTorch are capturing the most attention. Businesses never know what will happen next year, or next week for that matter. They should choose a data platform that won’t tie them into one framework or one way of doing things, one that has an architecture that can grow as they do and accommodate new tools and technologies as they come along.

5. Embrace third-party data

The cloud makes it much easier to incorporate external data from partners and data-service providers into models. This was particularly important over the past year, as businesses sought to understand how the impact of COVID-19, fluctuations in the economy, and subsequent changes in consumer behavior, would affect their businesses. For example, organizations used data about local infection rates, foot traffic in stores and insights from social media to predict consumer buying patterns and forecast inventory needs. By doing so, they were able to adjust product inventory and staff numbers to meet customer demand.

Capital One is just one example of a business successfully harnessing Snowflake’s Data Marketplace to access and share data with third parties quickly and securely. The bank uses Starschema, a third-party data provider, as part of its broader efforts to understand the impact of COVID-19. This helped the company to forecast and plan response scenarios for its workforce and its customers. This is something all businesses should take note of - by exploring the numerous external data sources available to them, organizations can better answer some of their most pressing business needs.

6. Don’t overcomplicate the process

It’s often said that when you have a hammer, everything looks like a nail, and this applies to AI technologies like machine learning and deep learning. They are immensely powerful and have a critical role to play for certain business needs, but they’re not the answer to every problem. Businesses should always start with the simplest option and increase complexity as needed. Try a simple linear regression, or look at averages and medians. How accurate are the predictions? Does the ROI of increasing the accuracy justify a more complex approach? Sometimes it does, but don’t jump to that option as your first instinct. Starting with a simple approach will help data teams transition from legacy architecture with ease and make the most of the ensuing benefits, such as third party data.

Once companies leave behind their preconceptions about the cloud and exploit its full potential, they will create an environment that delivers enhanced and dynamic data analytics. This will empower data teams to better understand their customers and create new revenue streams by monetizing their data. As the data science market continues to grow, now is the time to address the challenges of migrating to the cloud and improve the way analytics is deployed and used.

Julien Alteirac is the Regional Vice President of UK&I at Snowflake.