[2024 Annual Guide] Web3 Data Tools and Tips
Indexers, Explorers, Query engines, Data Transformations, and ZK reverse ETL - the key components you need to understand to navigate crypto data.
This year, I’ll give my raw thoughts on the latest state of the crypto data stack. Then I’ll give my usual overview of data tooling landscape.
Disclosure: I (still) work full-time at Dune.
Thanks to Hildobby, 0xKofi, and Vishesh for their review and input to this article.
Want to join an onchain data community? I’ve created an NFT passport and quest system that any analyst can participate in.
🔍 Learn about the Bytexplorers, then mint your passport and join in
Setting the Stage (2021-2023)
By the end of 2021, we had indexers (like Alchemy), query providers (like Dune), and high touch data providers (like Messari). They mostly stayed in their own product lanes and tried to just keep up with the data and demand. You can learn about the basic layers of the landscape in my guide from two years ago.
By the end of 2022, everyone across the stack built support for at least 10 layer 1 blockchains, and had to deal with scaling issues. For indexers it meant standardizing and supporting client RPCs across chains, for query providers it meant spinning up open source abstractions, and for high touch data providers it meant aggregating dozens of different data sources across the stack for their research. These were all custom data engineer stacks that were 0-to-1 in design.
The crypto app data stack journey:
You’re a seed stage startup, and you start by indexing just your own contracts thinking “oh this is easy and cheap”. It feels good to build your own pipelines, this is the web3 way.
Your first hurdle will be needing to access more complex calculations or multiple structs at once. The protocol team will cram as much as they can into a read function so that your frontend can work with just a few eth_call RPC requests. Even higher complexity needs can be handled by deploying new contracts specifically to handle the computation, some examples are Uniswap TWAP oracles and generic multicalls.
But at some point you realize you suddenly need A LOT more data:
“I need to index some of my competitors data”
“I need to index many more chains”
“I need more data on my users’ wallet activity for segmentation and sybil analysis”
You’ll run into all of these questions after a few reps of “discover → explore → dig → define”, which is a process I covered in last year’s guide.
You then send one or two poor engineers on a two month slog to try and build this themselves off of only the indexer, before someone says “hey we can use a subgraph for this” or “hey this Dune query has everything we need”. Perfect!
Fast forward a year, and you have a complex multichain data pipeline that is half built in-house and half built on public platforms. You’ve probably ended up contributing to one of the scattered data repos across the ecosystem, such as Zerion DeFi SDK, Dune Spellbook, and Defillama adapters.
I think that figuring out this data journey has become a rite of passage for all crypto startups. But obviously all major data players have realized that this painful process exists, so they aren’t just sitting idly on their hands. So then what are they doing to make things easier?
What happened to the data ecosystem in 2023?
There were four main trends I saw in the data ecosystem over the last year:
Most major products tried entering new vertical markets (through M&A or development) and worked on establishing their SaaS revenue model.
I think everyone has implemented a tiered pricing system by now. You have products doing credits by second, by query, by model, by storage, etc.
Acquisitions of younger players; n.xyz acquired by Snowflake, transpose acquired by Chainalysis, satsuma acquired by Alchemy.
We’re reinventing the indexer (and no it’s not just because of reth).
We saw many new data explorers launch and old ones get beefed up.
A lot of new data is not yet indexable or hard to parse. I talked about these changes in my DuneCon talk this year.
With intents, ZK proof apps, account abstraction, flexible rollups, complex bridges, and data availability all getting more advanced, it’s not yet clear what the best way to index this data is (or who should be responsible for it).
I’ll dig into these trends by product in the data landscape section below.
2024 Landscape
I’ll explain each layer of the stack, but I won’t explain all the tools. You can read last year’s guide for a general product overview of the top tools. The companies shown are not always all-inclusive for a given category, but are the top ones in my opinion.
web2 counterparts
If you are new to web3 and this looks confusing to you, think of these web2 parallels:
indexing: amplitude, stripe, general logging services
explore: grafana, datadog
query: metabase, hex
define and store: dbt, snowflake, databricks
…and back onchain: this is just reverse ETL
Index: Read raw data from the blockchain
Blockchains run with nodes. Nodes run “client” code - which are repos that have implemented the EVM in some fashion. These clients have a set of implemented RPCs (API endpoints), some of which are standard and some of which are custom to support better/faster data querying. Data from RPC endpoints are direct reflections of the blockchain state - there are no external party changes or transformations here. Nodes-as-a-service providers run tons of nodes and offer them collectively through an API product, creating a stable data service.
We’ve seen a ton of change in the indexing layer over the last year, with two new subcategories emerging:
Forks-as-a-Service:
Fork any contract and add events and calculations, and then pull this data from a new “forked” RPC/data service.
Some of the main providers for this are shadow.xyz, ghostlogs, and smlXL.
I gave my thoughts on shortcomings and difficulties on this approach. I believe that it will see adoption by some app teams who have the budget, but will be hard to integrate consistently across the data ecosystem.
Rollups-as-a-Service (RaaS):
The big theme of the year has been rollups, with Coinbase kicking it off by launching a rollup (Base) on the Optimism Stack (OP Stack) earlier this year.
Teams are building products specifically for running the nodes and sequencer(s) for your own Rollup. We’ve already seen dozens of rollups launch.
New startups like Conduit, Caldera, and Astria are offering full stack rollup services. Quicknode announced a similar offering last month, and I expect Infura and Alchemy to do so as well in 2024.
Notable changes to existing products:
Alchemy and Quicknode expanded further into crypto native infra and data engineering infra. Alchemy launched account abstraction and subgraph services. Quicknode has been busy with alerts, data streaming, and rollup services.
Predictions for 2024:
We should see our first “intents” clients/services. Intents are a part of the modular stack, and are essentially transactions handled outside the mempool that have extra preferences attached. UniswapX and Cowswap both operate limit order intent pools, and should both release clients within the year. Account abstraction bundlers like stackup and biconomy should venture into intents as well. It’s unclear if data providers like Alchemy will index these “intents” clients, or if it will be like MEV where we have specialized providers like Blocknative and Bloxroute.
Another up-and-coming type of provider is the “all-in-one” service, which combines indexing, querying, and defining. There are a few products here such as indexing.co and spec.dev - I have not included them in the landscape since they are still nascent.
Explore: Quickly look into addresses, transactions, protocols, and chains
It’s common to jump between well crafted data dashboards and blockchain explorers iteratively, to help you identify a trend or build out a cohesive data pipeline.
Outside of etherscan, we now have this plethora of explorers to choose from:
More intuitive explorers like parsec, arkham, onceupon showcasing more metadata and charts for transactions and addresses.
In depth explorers like evm.storage showcasing storage slots and memory stacks
Cross-chain explorers like Routescan (superscan), Dora, and Onceupon
Bridge explorers like socketscan and wormhole
MEV explorers like Eigenphi for transactions or mevboost.pics for bundles and beaconcha.in for blocks
Nansen launched a 2.0 of their token and wallet tracking product, with cool new features like “smart segments”
The dashboard layer hasn’t changed much. It’s still the wild west here. If you spend a day on Twitter, you’ll see charts from dozens of different platforms covering similar data but all with slight differences or twists. I believe that verification and attribution are becoming a bigger issues, especially now with the large growth in both teams and chains. I don’t expect this to be solved in 2024, there just isn’t enough incentive for platforms to do anything about it yet.
Predictions for 2024:
Marketing specific address explorers are going to be on the rise. Teams like spindl and bello will lead the way here.
Cross-chain explorers (and pre-chain like MEV/intents) will see expansion in development.
Across platforms, I still think wallets are still poorly labelled and tracked, and it’s getting worse with intents/account abstraction now. I don’t mean static labels like “Coinbase” but instead more dynamic ones like “Experienced Contract Deployer”. I know some teams are trying to tackle this such as walletlabels, onceupon context, and syve.ai. It will also improve naturally alongside web3 social, which is growing mainly on Farcaster. I do expect a lot of improvement here in 2024.
Query: Raw, decoded, and abstracted data that can be queried
Most SQL query engines are cloud-based, so you can use an in-browser IDE to query against raw and aggregated data (like nft.trades/dex.trades). These also allow for definition of great tables such as NFT wash trading filters. All these products come with their own APIs to grab results from your queries.
GraphQL APIs here let you define your own schemas (in typescript or SQL) and then generates a graphQL endpoint by running the full blockchain history through your schema.
For predefined APIs (where you query prebuilt schemas), there are a ton of niche data providers that I have not included in my chart. Data providers covering domains like mempool, nft, governance, orderbook, prices, and more. You can check out my friend Primo’s site to find those ones.
Holistically, every platform has got a lot more efficient with their infra (meaning your queries run faster). Most platforms explored advanced methods of getting data out like obdc connectors, data streams, s3 parquet sharing, bigquery/snowflake direct transfers, etc.
I think I speak for most of us when I say a lot of our time went into just trying to keep up with Solana and L2s this year.
Notable changes to existing products:
Query engines like Dune and Flipside have accepted there is more data than can possibly be ingested in custom data pipelines, and have launched products that allow the user to bring in that data instead. Flipside launched livequery (query an API in SQL) and Dune launched uploads/syncs (upload csv or api, or sync your database to Dune directly).
Our favorite decentralized data child, the Graph, has had to really beef up their infra to not lose market share to centralized subgraph players like goldsky and satsuma (alchemy). They’ve partnered closely with StreamingFast, separating out “reader” and “relayer” of data and also introducing “substreams” which allow you to write rust based subgraphs across chains.
Predictions for 2024:
I believe that no provider here is truly set up for the rollup world yet, either in terms of scaling ingestion or fixing cross-chain schemas/transformations. And by not ready, I mean not ready for the case of 500 rollups launching in a week. I expect that some platform will figure out how to scale for this by late 2024.
LLM query/dashboard products like Dune AI will start to gain stronger traction in certain domains, such as wallet analysis or token analysis. Labels datasets will play a strong part in enabling this.
Define and Store: Create and store aggregations that rely on heavy data transformations
Raw data is great, but to get to better metrics, you need to be able to standardize and aggregate data across contracts of different protocols. Once you have aggregations, you can create new metrics and labels that enhance everyone's analysis. I have only included products that have active contribution from both inside and outside the platform’s team, and are publicly accessible.
The collaborative layer of data definition has not really evolved over the last year. Product teams and engineering are barely keeping up as is. To give a sense of growth rate here, in the month of December 2023:
Defillama adapters saw 230 pull requests from 150 contributors across 559 files out of about 3700 total files.
Dune spellbook saw 127 pull requests from 42 contributors across 779 files out of about 3700 total files.
You can probably consider every 2-3 files to be a new protocol/table. So 200+ tables shifting around every month, and that number is only increasing with new chains and startups. If you’ve never worked in GitHub or data tables before, let me tell you that’s a ton of work to manage.
On the data warehouse side, warehousing in platform is something Dune is actively expanding into. I like to think of Dune now as “fit your ETL pipeline into one query” - upload your data, create materialized views, schedule queries/dashboards, and set up alerts; all in a single query. If that sounds interesting to you, go reach out to Niles, Mats, or me on Twitter about it.
Predictions for 2024:
Shouldn’t be anything surprising here, just each platform building up the features that already exist on older data platforms like Snowflake and Databricks.
Back Onchain: Putting the data back into contracts using Zero Knowledge (ZK) tech
This layer is completely new, and still very nascent. The idea here is that you can prove some data or computation was correctly collected using a “ZK circuit”, and then post that proof onchain (sometimes with outputs, depending on the application). You can get a sense of how this works by trying to create your own simple identity proofs yourself. This presentation from the summer also gives a good summary of these “ZK coprocessors”.
For historic data access and compute, the main players for now are herodotus, axiom, succinct, lagrange, and spaceandtime. They each have slightly different approaches to their prover and ZK stack, which does impact the type of data and type of calculations that you could verify and post using each tool.
ZK machine learning is idea that you can prove that a certain inference came from a certain model that was trained on certain data. EZKL is the backbone of most ZK machine learning stacks right now, other than modulus who is building their own custom ML model prover system. Ritual, Giza, and Spectral all use EZKL for the model proof portion of their stack for now.
ZK machine learning is not as simple as deploy and run (can’t just host on kubeflow), because you now need provers as part of the stack. Products like gensyn, blockless, and other AVS providers are working on forming a prover/compute marketplace.
Predictions for 2024:
I believe that ZK is posed to really take the spotlight in apps in 2024. I know that’s kind of like saying “web3 gaming is coming”, as people have been saying it for years. But I really do believe it is coming this year!
To get a good sense of where ZK fits in to the future and some of the surrounding technologies enabling it, read these three articles:
An Aside on Solana Data
Solana will need to be its own article next year, but all you to know for now is:
Explore:
Use solscan.io or solana.fm for transactions and programs
Use step.finance for wallets
Most of the same dashboard services in EVM also include Solana data
Indexers: Use Helius or Quicknode.
Helius has enhanced API endpoints for stuff like token balances. The company is not to be confused with Helium, the decentralized telecom network.
You might have heard of firedancer, but that isn’t likely to come out until late this year.
Query: Use Dune or Flipside for queries and dashboards
Dune has fully decoded Solana tables.
If you need graph data science support, go ask the Allium team for help
Here’s a guide I wrote to getting started with Solana, for both beginners and experts.
Predictions for 2024:
Helius changes its name
Compressed data becomes more and more prevalent
Solana data gets magnitudes easier to index, decode, and analyze
Cheers to another year!
Thanks for reading this far, I hope you’ve learned something new and found a few new tools to try out. I look forward to continuing to build alongside you all.
My DMs are always open if you are building a new tool in the data space or are getting lost while building your own data pipeline/analysis.
Want to join an onchain data community? I’ve created an NFT passport and quest system that any analyst can participate in.
🔍 Learn about the Bytexplorers, then mint your passport and join in
Such a great article, I feel I can see the whole earth :)
Just wondering, why I don't see Glassnode on the map.