How To Analyze Bitcoin Data With SQL
Learn how Bitcoin works, and how to analyze block statistics, transaction values, active addresses, and address balances using Dune!
Want to join an onchain data community? I’ve created an NFT passport and quest system that any analyst can participate in.
🔍 Learn about the Bytexplorers, then mint your passport and join in
Are you here because you’re interested in Ordinals? Check out this guide!
Starting today, you can play with Bitcoin data on Dune. I will explain the basic concepts you need to know and teach you how to leverage our tables to analyze block statistics, transaction values, active addresses, and address balances. We’ll also cover some more technical bits on the evolution of address types, scripts, and signatures.
All the charts in this article will be in this dashboard. 📊
There are four tables: blocks, transactions, inputs, and outputs. Each table and column is defined in the Dune docs. If you aren’t sure exactly what a column I use means, refer to the docs. I’m not going to cover every column in this article, just the ones you need to know how to work with.
I’m assuming you already have basic SQL knowledge; if you don’t, you can reference this article.
Blocks
A block is a set of transactions that have been confirmed. They’re supposed to happen every 10 minutes, and the miner who confirms the block gets a reward in BTC (and all the fees paid by transactions).
We’re going to use only the bitcoin.blocks
table for this section. Below is the query we will use for every chart in this section. We take a subquery first to get the time between blocks block_pace
and to set a constant line to signify the block size limit before and after the SegWit upgrade (explained soon). We then take the per-block averages/sums for a given time granularity (week, month, year) to see the stats over time.
Let’s start by explaining mint rewards and transaction fees
The first transaction of every block mints rewards to the miner (mint_reward
), and is known as the coinbase transaction.
There is a halving of mint rewards every 210,000 blocks (4 years). We’re currently at block 771,515, so rewards have decreased to 6.25 (original 50 BTC reward divided by two, three times now). The next decrease (to 3.125) will occur around March 2024.
Weekly BTC rewards from both fee and mint are currently around 6.5k (or $125 million at today’s price of $19k/BTC).
We calculated cumulative supply and fees using some window functions, using the last query as a CTE (i.e., querying the results of that subquery).
From here, we can visualize how close the total supply is to the 21 million BTC max, and how the price has trended as the mint rate has decreased.
As well as the total fee revenue for miners (coming out to only 258k BTC as of Jan 2023).
Mining pace and difficulty go hand in hand.
For mining, the hash rate (compute power) required to solve a block changes dynamically by following a difficulty
parameter. The difficulty
is adjusted based on if the blocks are mined over or under the 10-minute block target.
You can see a large dip during the China mining ban in June 2021. Hash rate is also calculated in the query using the formula “hash rate (hash/s) = difficulty * 2^32 / 600”, though I don’t show it since it’s the same curve as difficulty.
Block sizes: There are limits to the amount of data that fits in a block, which then limits the number of transactions.
Before SegWit (2017), each block was limited to 1MB of data (1 million bytes), equating to roughly 1650 transactions in a full block. SegWit separated signature data from transaction data so that we now have stripped_size
in bytes. Alongside this, we were introduced to block weight, which replaced block size as the limiting factor for a block. Basically, instead of size
being limited to 1MB, it’s now:
stripped_size
* 3 + size
= weight
Where the weight must be less than 4 million. This size weighting is applied on the transaction level as virtual_size
, you can read more about this here.
Note that we’re using stripped_size
here for the raw bytes size measurement (shaded blue), since that’s the more canonical measure of byte size of blocks now.
You can see that the upgrade happened right as Bitcoin started to hit the 1MB limit and that even after the upgrade, there have been periods of time where the 4MB weight limit was hit.
Transactions
Here we’ll walk through bitcoin.transactions, bitcoin.inputs, and bitcoin.outputs.
Understanding the Unspent Transaction Outputs (UTXO) Logic
First off, there’s no notion of a “sender” at the transaction level. Everything happens in each transaction's inputs (with signatures) and outputs. You’ll see there are varying counts of input and output arrays for each transaction - most of which are tied to an address.
Bitcoin wallets consist of a public/private key to generate an address, which can then hold ownership over signatures that hold ownership of tokens. Each token has a full lineage of transfer history (so they aren’t technically fungible) - this makes things a little harder to understand. To understand a transaction, you first need to understand how these token values flow - specifically, the concept of unspent transaction outputs (UTXOs).
So each set of inputs has signatures attached, proving ownership of older outputs. This lineage goes all the way back to each mint (coinbase) transaction! It might still feel confusing, so let’s look at the data.
We’ll look at this transaction. Let’s filter for this transaction from the bitcoin.transactions table, and unnest(input)
to get all inputs (you could also join on bitcoin.inputs where transactions.id
= inputs.tx_id
).
If you look at the results, you’ll see there are two input values of 0.0104 and 0.165 we pull in with the join to get the referenced output tx_id’s value
. These sum up to the total input_value
from the transactions table because we already did the summation for you. The same logic applies to output
column. If you unnested that or joined bitcoin.outputs where transactions.id
= outputs.tx_id
, you’d find the individual value’s sum to the output_value
in the transactions table.
Each transaction’s fees can be calculated by subtracting the total output value from the total input value. We have also already calculated the fee in the fee
column.
A quick note,
sequence
is used for increasing fees to speed up transactions. The default is 4294967295. You can read on this further if you’re interested in relative versus absolutelock_time
to control when your transaction can be included in a block.
Probably the most confusing for you will be which id
column to reference. To connect an input to the output it’s referencing, use the spent_tx_id
column. If a tx_id
in the bitcoin.outputs table is not found in the bitcoin.inputs table under the spent_tx_id
column, then that means it hasn’t been used yet. Each output (UTXO) can only be referenced (spent) once. The tx_id
column references the transaction they originated from for both the inputs and outputs tables.
So hopefully, you now understand how transactions, inputs, and outputs are tied together - let’s focus on transactions and outputs and calculate the total volume of outputs and coin days destroyed.
Output Volumes Over Time: Technically this is an inflated value. This is because if I spend from an input with 100 BTC, and send 2 BTC to someone else, I get two outputs where 98 BTC goes back to me and 2 BTC goes to someone else. You’d need to check address balance differences per block to get more exact on volumes, I’ll do that on a later date.
Coin Days Destroyed (CDD): Since we can track age of UTXOs, we can give it a weighted age when it is spent. For example, A UTXO for 2 BTC dormant for 100 days has accumulated 200 coin days. The CDD metric captures both coin lifespan and coin volume - coins held in cold storage as a long-term store of value are considered economically important when spent as it signals a notable change in long-term holder behavior.
You’ll see in this second chart that it’s mostly smoother than the first - but the volatile peaks where old UTXOs were spent are much more pronounced.
Understanding Bitcoin Scripts and Signatures
Let’s talk about the more confusing technical concepts related to bitcoin.inputs. This will cover the script
, script_signature
, and witness_data
columns (on block explorers you might see pkscript
, sigscript
, and witness
respectively). More generally, these columns are where “smart contracts” are built.
There’s script
columns on inputs and outputs, and these are where Bitcoin assembly takes place. There are many different kinds of script op_codes, and you can read about how they’re ordered together typically here. I’ll cover this more in the addresses section since you need more context on address types first.
You may have noticed that script_signature
is null on a lot of new transaction inputs, and instead witness is populated. This witness_data
(signatures) are now stored separately from transaction data such that the size is smaller.
SegWit is supposed to have empty script_signature
data so that the transaction isn’t malleable after signing (even if it hasn’t been confirmed on chain yet), which is why you can’t send tokens from a SegWit wallet to an older wallet. This upgrade was required for lightning networks to operate safely.
It’s taken about 5 years, but 75% of transactions are now SegWit transactions.
Signatures have continued to evolve, and the Schnorr, Taproot, and Tapscript upgrades (BIP 340, 341, and 342) were activated and enforced at block 709,632 on November 12th, 2021. The TLDR is that Schnorr signature combines multiple signatures into one, and taproot allows for Merkle trees of scripts so that you can batch many transactions into one (P2TR).
This ultimately means more complex scripts and lower fees, since they’re all batched together into one transaction. All signatures can be decoded by version as well. Script and signature decoding is fairly advanced and something I haven’t played with enough yet. If you do dive into it and figure things out, please share your learnings with me 🙂
For a much better explanation of signatures than I can provide in writing, check out this youtube video.
There are ongoing arguments on how to use these signature innovations for privacy preservation (different from zk-proofs). Other efforts include CoinJoin (think Tornado Cash).
Addresses and Balances
Active Addresses
To get active addresses, we take approx_distinct(address)
on the inputs table.
Here granularity is set to week, so weekly active addresses have fluctuated around 4m before dropping in the last month or so.
Calculating wallet balances
For this, you need to sum up all the UXTOs that haven’t been spent based on address
. This means we check for all transaction outputs whose tx_id
are not in the transaction inputs table, including filtering on the index of the output id.
In this query, we’re checking for outputs that haven’t been spent in some future input. If it hasn’t been spent, then that address still has ownership of the tokens. We also get each wallet's value-weighted age of UTXOs in days (same idea as CDD).
There are some nuances that cause addresses to be null - we’ll get to that next.
Understanding all the address types
You’ll notice that there’s a type
column on inputs and outputs tables. There are many different types of addresses on Bitcoin, and they have evolved to make the protocol more extensible and cheaper. Here’s a query showing the unique addresses by type over time:
The breakdown of addresses by outputs (UTXOs received) is 50% SegWit (witness prefixed), 29% scripthash, and 21% legacy pubkeyhash.
I’ll describe each address type below, and how to identify them:
Let’s look start with the basic types:
pubkeyhash: These are the first implementation of addresses, and they start with 1 (and are base58 encoded).
scripthash: These are p2sh addresses starting with 3, and are how script-based multisigs and lightning networks (2/2 multisigs) are implemented.
some script wrappers (P2SH-P2WPKH and P2SH-P2WSH) allow older wallets to pay SegWit wallets.
The following are bech32 addresses (all start with bc1)
witness_v0_keyhash: P2WPKH - this is SegWit native.
witness_v0_scripthash: P2WSH - this is SegWit native.
witness_v1_taproot: taproot is the newest implementation that is fully backward compatible (payments-wise).
You’ll notice some address types are null:
pubkey: early on, they used the full raw public key, so there wasn’t the concept of addresses we have today. You could use this to extract something, but it’s technically not “real” due to a missing version byte and checksum.
nonstandard: these are custom scripts that must be decoded manually
multisigs: Depending on the script, getting the address out of the
script
requires a base58 encoding function that isn’t supported in DuneSQL currently. This differs from a scripthash multisig in that a set of private keys is actually associated.nulldata: this is a transaction that reverted with 0x6A (OP_RETURN)
Going deeper on the
script
columns; these are often used to support different payment types (those weird acronyms you just saw). Here are some examples that are identifiable by their respective first bytes, in case you want to dive deeper:Pay-to-Public-Key-Hash (P2PKH): 0x76 (OP_DUP) 0xA9 (OP_HASH160)
Pay-to-Script-Hash (P2SH): 0xA9 (OP_HASH160)
Pay-to-Public-Key (P2PK): 0x41 (OP_DATA_65)
Multi-Sig (P2MS): 0x52 (OP_2) for 2 signers, 0x53 (OP_3) for 3 signers, or more (designated as
n
). The second byte gives you the number of total members.Pay-to-Witness-Public-Key-Hash (P2WPKH): 0x00 (OP_0)
Pay-to-Witness-Script-Hash (P2WSH): 0x00 (OP_0)
Pay-to-Taproot (P2TR): 0xA5 (OP_CHECKTEMPLATEVERIFY). This is still being experimented with; you can check some examples here.
Null Data (OP_RETURN): 0x6A (OP_RETURN)
Congrats on learning how to work with Bitcoin data!
There you have it! You should now be familiar with the basics of bitcoin. If you want to continue your analysis, try creating metrics like the ones in the indicators section.
Once you’ve done so, you can request an addition to this dashboard by DMing me on Twitter - you’ll get a shoutout (at the very least)! I’m really looking forward to your submissions. Even if you get stuck, feel free to reach out. Let’s build the best Bitcoin metrics together 💪
Want to try some more advanced address/script analysis? Try getting Ordinal mint data using this guide.
If you’ve found an error with anything in this article, please email or DM me so I can fix it. 🙏
The "Bitcoin Balances" query only checks WHERE i.spent_tx_id = o.tx_id to discard spent outputs.
Shouldn't it consider i.spent_tx_id = o.tx_id AND o.index = i.spent_output_number ?
The input could be spending another output from the same transaction!