The stages of the data organization


As someone who has been researching and took part in shaping up data organization within various startups of different stages, sizes, and industries, I find this an interesting challenge to be focused on in the early stage most of the time. Making the right decision on this front will not only help companies reduce cost, minimize duplications of work, turn-over rate, but also remove dysfunctional company cultures.

With that in mind, I will share some of my views on how structures could be defined in different data organizations. These views come from my personal experience and opinion throughout these years in the data field.

Note: in this post, we will not cover data analytics companies, in which, data, or AI, data science in particular is the main product that bring in the main revenue.

The young startups

The data organization, or rather, the data team at this stage, is usually started by a technical co-founder, who is interested in doing some business reporting, visualization or simply exploration.

At this stage, any attempts to decentralize the data team will face lots of difficulties, mostly in term of budget, alignment, and efficiency. In most of these companies, the data team is still fully a cost center that is unlikely to change any time soon.

Data team in young startup

It is very important to focus on introducing clear team directions and also, a centralized data infrastructure and platform at the first main project of the data team. Depending on the size of the current data team, some companies might either chose to build things mostly in-house, with a good amount of data engineers (and data analysts), or employ an all-in-one solution.

Reports and analysis can be done in a general approach that give the business some senses of what’s going on at high level. Data Science or Machine Learning are more or less ad-hoc exploratory and demonstration instead of focusing on specific problems — because those problems are likely not that critical as the business evolves.

The overall objective is to build the strong infrastructure as well as a shared knowledge base within the team, while staying relevant to the business.

Photo: honestbee circa 2016

The hyper-growth startups

Large investments started coming in, data team is expanding at the same rate, or sometimes (unfortunately) slower than the increasing rate of requests coming in — those requests are no longer general but more complex and specific to different business-centric nature. Each business function now requires dedicated data analytics capabilities, where the original centralized structure is standing in the way.

At this stage, a good part of the original Data Team should be moved into a team called Data Platform Team, where their main goal is to improve the data infrastructure (data lake / data warehouse / machine learning platform / ETL platform, and so on). This team’s objective is quite similar to the original team’s, with much more focus in infrastructure scalability, while the stakeholder is narrowed down to only within Data Vertical itself.

As for business needs, there will be data members that are embedded into each specific functions (yet still stay within the Data Vertical), becoming each Business Function Data Team. Depends on the function and the opportunities inside each function, the amount of Engineers / Scientists / Analysts needed are to be determined and justified accordingly.

The Data vertical is partly a community, where people share knowledge and level up their technical expertise while also, partly a centralized authority, where levels, hiring decisions, performance reviews are also being fully or partly conducted.

Data vertical in a hyper-growth startup

This setup serves two main purposes: enhance autonomy in each business function with regards to data analytics capabilities, yet remain transparent and aligned, reduce duplications of work, wrongly justified head-counts, politics, and so on.

For reference, a very similar approach (engineering focus) is discussed here at Spotify.

The (tech) giants

The company is now expanding beyond any imagination from the original one, it is now a multi-billion dollar corporation. New businesses are being built and acquired within different functions. The company now exists in more countries and regions, with large amount of new hires and teams formed.

At this point, there are a couple of challenges that has never been addressed before:

  • Each Business Function Data Team is now as large as the original Data Vertical, they naturally form communities within the community. Thus, the original community is no longer relevant
  • Different internal businesses means completely different structure. Enforcing cross-business data organization is impossible
  • Multi-timezone and locale challenge makes collaboration very difficult
  • Finding local talent can be a complicated process in regard to alignments between global hiring standards, regional budgets, and local business requirements.

That said, this is an interesting problem only a few companies in the world will face. There is no formal approach but a list of suggestions that I find quite valuable to follow:

a) Knowledge base sharing remains a core part of the Global Data Organization. Here is a good example on how it was done at Airbnb.

b) Form the Global Data Platform team from the original Data Platform team, and optionally, with supporting members embedded inside different regions and businesses. This team’s responsibility is to build scalable tools, frameworks, and infrastructures that supports the whole organization. Their products should be high level and general enough that they can be used across different business. Here are a few public examples of what their products would look like:

  • Airflow: Airbnb’s workflow management platform
  • Michelangelo: Uber’s machine learning platform

c) Treat different region (or business) as each startup that has their own data organizations. They are now structured either as a centralized Data Team, or a cross-functional Data Vertical, depending on the size and maturity of the start up they now belong in.

Data teams and verticals in tech giants


Transparency, alignment, and autonomy are the three important characteristics of any good team and organization.

As the company grows, scaling productivity and knowledge while preserving those three characteristics in the data organization becomes a challenging subject that are unique to each company.

While it is clear that creating a good structure involves a lot of forward thinking and commitment from every part of the business, it is also worth emphasizing that these structures need to constantly evolving along with the business itself.

PS: Do you want to discuss more about this topic, which stage of the data organization are you, what do you think about these structures and suggestions? Leave a comment below, send me a message on my social media, or drop me an invite for a coffee if you are in Singapore 🙂.

The stages of the data organization was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

Please follow and like us: