Learning to Work With Ethereum Storage and State Data
Breaking down the traces RPC and storage layouts, with queries!
Most explanations of storage and state on Ethereum are diagram heavy and don’t get into the meat and bones of how to access or work with that data. I’m going to avoid the diagrams and theoretical explanations, but you should still have some understanding of data tries so skim this and this if you know absolutely nothing about how data gets into blocks.
Essentially, what you need to know is that every time you make a transaction with a contract involved, there will be state and storage changes saved to the block. State changes involve things like an addresses’ Ether balance or Nonce, storage changes involve storage slots on contracts such as your balance of ERC20 tokens in a contract.
To access these changes from an Ethereum node, you’ll need to use a replay VM - meaning you have to replay a transaction or block to get the raw diffs in state and storage. This is easily done using Infura or Alchemy, without running your own node. You can go to Alchemy’s API composer to play with the trace_replayTransaction
endpoint (here). If you want to read more about how traces works, read the standards here - but I’ll run through one example.
Let’s use this ERC20 transfer, whose contract code includes the following:
function transfer(address to, uint256 value) external returns (bool) {
_transfer(msg.sender, to, value);
return true;
}
function _transfer(
address from,
address to,
uint256 value
) private {
balanceOf[from] = balanceOf[from] - value;
balanceOf[to] = balanceOf[to] + value;
emit Transfer(from, to, value);
}
You can see that there are two storage changes, the balance of the from
address and the balance of the to
address. Here’s your JSON rpc request:
{
"jsonrpc":"2.0"
"id":0
"method":"trace_replayTransaction"
"params":[
0:"0x8e5cd4d97881b558b7c4a120c41193b944db0612bc45807ae056d5488b82700e"
1:[
0:"trace"
1:"stateDiff"
]
]
}
And here’s the response you’d get from Alchemy, I’ll explain it briefly here:
state diffs for
0x320d83769eb64096ea74b686eb586e197997f930
are balance changes in the mapping ofbalanceOf
state diffs for
0x52bc44d5378309ee2abf1539bf71de1b7d7be3b5
are addition to balance of gas fees being paid to minerstate diffs for
0xdad3fd6c9fb0c2b56228e58ae191b62bfb1bec83
are subtraction from balance of gas fees being paid by the sender for this transferthe first trace action is the actual
transfer()
callthe second trace action is a
delegatecall
because a proxy pattern is used for this contract, and the logic is stored somewhere else (like forwarding an email).
{
"jsonrpc": "2.0",
"id": 0,
"result": {
"output": "0x0000000000000000000000000000000000000000000000000000000000000001",
"stateDiff": {
"0x320d83769eb64096ea74b686eb586e197997f930": {
"balance": "=",
"code": "=",
"nonce": "=",
"storage": {
"0x108f03599a018b41696aad07e2100369d10122c70bf27d8e9fe3e425885a246b": {
"*": {
"from": "0x0000000000000000000000000000000000000000000000357bdcd5b903c454ce",
"to": "0x0000000000000000000000000000000000000000000000000000000000000000"
}
},
"0xdc7fafdc41998a74ecacb8f8bd877011aba1f1d03a3a0d37a2e7879a393b1d6a": {
"*": {
"from": "0x000000000000000000000000000000000000000000000215d691e79c0df07a6f",
"to": "0x00000000000000000000000000000000000000000000024b526ebd5511b4cf3d"
}
}
}
},
"0x52bc44d5378309ee2abf1539bf71de1b7d7be3b5": {
"balance": {
"*": {
"from": "0xfdc72883cded567d93",
"to": "0xfdc728a1a08db20db3"
}
},
"code": "=",
"nonce": "=",
"storage": {}
},
"0xdad3fd6c9fb0c2b56228e58ae191b62bfb1bec83": {
"balance": {
"*": {
"from": "0x59e417e4aa86a74",
"to": "0x599f683b630a274"
}
},
"code": "=",
"nonce": {
"*": {
"from": "0x159",
"to": "0x15a"
}
},
"storage": {}
}
},
"trace": [
{
"action": {
"from": "0xdad3fd6c9fb0c2b56228e58ae191b62bfb1bec83",
"callType": "call",
"gas": "0x938c",
"input": "0xa9059cbb000000000000000000000000000000000000000000000000000000000000dead0000000000000000000000000000000000000000000000357bdcd5b903c454ce",
"to": "0x320d83769eb64096ea74b686eb586e197997f930",
"value": "0x0"
},
"result": {
"gasUsed": "0x45d0",
"output": "0x0000000000000000000000000000000000000000000000000000000000000001"
},
"subtraces": 1,
"traceAddress": [],
"type": "call"
},
{
"action": {
"from": "0x320d83769eb64096ea74b686eb586e197997f930",
"callType": "delegatecall",
"gas": "0x7e24",
"input": "0xa9059cbb000000000000000000000000000000000000000000000000000000000000dead0000000000000000000000000000000000000000000000357bdcd5b903c454ce",
"to": "0x5b5746f6f5e2db8bf5e260829ca7a004c876b167",
"value": "0x0"
},
"result": {
"gasUsed": "0x323b",
"output": "0x0000000000000000000000000000000000000000000000000000000000000001"
},
"subtraces": 0,
"traceAddress": [
0
],
"type": "call"
}
],
"vmTrace": null
}
}
vmTrace is a deeper trace, which tells you the order in which opcodes were executed by the EVM. I won’t cover that in this article (mainly since I am not familiar enough with those yet lol)
You can check the balanceOf
changes using hex to num here, remember to div by 1e18. If you’re curious about how subtraces work, they are nested function calls:
subtraces = number of sub-calls by function
type =
call
(normal call),staticcall
(doesn’t mod state),delegatecall
(proxies calling logic),create
(contract deploys)traceAddress = order of call execution, see example below from open Ethereum:
A
CALLs B
CALLs G
CALLs C
CALLs G
then it should look something like:
[ {A: []}, {B: [0]}, {G: [0, 0]}, {C: [1]}, {G: [1, 0]} ]
A good tool to use is the Versatile block explorer, where you can check the “Invocation Flow” for a more detailed view of traces than Etherscan provides.
Don’t worry too much if that doesn’t all make sense, we’ll dive into the data here using a data warehouse instead of RPC calls from here on out.
Accessing State and Storage Data using Ethereum Data Warehouse (EDW)
EDW is events, calls, transactions, state, and storage data in a Snowflake shared database. You can get setup here with a custom app or just start querying on Flipside by changing the schema to tokenflow_eth
. There are some example queries here. I’m assuming you’ve already used something like Dune Analytics or Flipside, so I’ll only cover the new tables.
state_diffs: use this for tracking ETH balances. here are the main state fields and reasons they change:
I like to use this table for wallet balances analysis - much faster than trying to aggregate on ETH calls from traces.
storage_diffs: tells you current and last value for different storage slots, with mappings nicely nested already in dict formats.
Contract storage layout comes from solc compiler. you can use a hardhat plugin to do this to get the following on each compile (or the raw JSON output):
This table is useful for tracking variables on contracts, most commonly would be stuff like historical balances.
storage_reads: this tells you when a certain slot of storage was read during a transaction.
useful for assessing contract interoperability/dependencies, when joined on
call_id
andtx_hash
with traces (calls)exploring what variables are read when a certain function is called, when joined on transactions with a selected funcsig in
call_data
.sometimes it's easier to use calls table (traces) when the value you want is in a function output, instead of the storage variables.
Example Queries: Looking at On-chain Price Oracles
I used the data outputted from these queries in further analysis, you can read and play with the code in the repo here.
Let’s run through a harder example than just balances and transfers: looking at price oracles on-chain. If you have no familiarity with Oracles, check out this explainer and this repo of older oracles.
tldr; Chainlink pushes prices from the real world on-chain so smart contracts can access it, Uniswap V2 and V3 have methods for calculating the token price purely on-chain. There are likely differences in prices at any given moment for a variety of reasons.
All the queries I’m going through below are public here, you can run them once you have EDW set up - just remember to change EDW_SNOWFLAKE_SECURE_SHARE_1645732350028
to whatever your secure share instance name is! I also set up udfs on a new database in Snowflake, since you can’t create udfs in a shared db.
Chainlink Query
Let’s start with Chainlink data feeds! The contract is set up such that the proxy itself doesn’t store historical prices - it calls aggregator contracts instead. There have been five such aggregators used for the ETH/USDC price feed, which I queried for here:
SELECT * --use curr_value for aggregator addresses to use in the next query
FROM EDW_SNOWFLAKE_SECURE_SHARE_1645732350028.raw.storage_diffs
WHERE contract = lower('0x5f4eC3Df9cbd43714FE2740f5E3616155c5b8419') --ETH/USDC proxy aggregator, ref https://data.chain.link/popular
AND location LIKE '4[%].0'
ORDER BY timestamp ASC;
I queried for any address in slot 4 mapping. Mappings in EDW are stored as slot[value].keys where keys apply to structs in mappings. Here there is no struct, so we take .0
instead. The variable this tracks is:
"4": [
{
"from": 0,
"name": "phaseAggregators",
"to": 31,
"type": "mapping(uint16, address)"
}
],
Then, I plugged the five addresses manually into a query over the calls
table, so I could get the values from each time the proxy contract called an aggregator.
SELECT
distinct
timestamp,
block,
udf.public.hextoint(return_value)/1e8 as eth_price
FROM EDW_SNOWFLAKE_SECURE_SHARE_1645732350028.raw.calls
WHERE from_address= lower('0x5f4eC3Df9cbd43714FE2740f5E3616155c5b8419') --ETH/USDC proxy aggregator
AND to_address IN ('0x37bc7498f4ff12c19678ee8fe19d713b87f6a9e6', '0xd3fcd40153e56110e6eeae13e12530e26c9cb4fd', '0xb103ede8acd6f0c106b7a5772e9d24e34f5ebc2c', '0xf79d6afbb6da890132f9d7c355e3015f15f3406f', '0xc7a37b03690fb9f41b5c5af8131735c7275446') --historical phaseaggregators for ETH/USDC as of 3/1/2022
AND substring(call_data,1,10) = '0x50d25bcd' --funcsig for getting latestRoundData
AND block >= 12381343
AND status = 1
ORDER BY 1 ASC;
Chainlink oracle usage queries are just an aggregation change, so I’m not including it here.
Uniswap V2 Query
Uniswap V2 pairs track token prices separately, in variable price1CumulativeLast
and price0CumulativeLast
for token1 and token0 respectively (and in storage slots 10 and 9 respectively)
For this pair, I queried storage at slot 10 and parsed curr_value
into an integer:
SELECT
block,
timestamp,
-- location,
-- tx_hash,
udf.public.hextoint(curr_value) as cumulative_last
FROM EDW_SNOWFLAKE_SECURE_SHARE_1645732350028.raw.storage_diffs
WHERE contract = lower('0xB4e16d0168e52d35CaCD2c6185b44281Ec28C9Dc')
AND location = '10' -- token1 price slot, which is WETH
AND block >= 12381343
AND status = 1
ORDER BY block, timestamp, order_index;
The rest of the data processing to get price was done in Python (see repo from top of section). V2 Oracle usage queries are basically the same as in v3, so I won’t go over them here.
Uniswap V3 Query
Uniswap V3 is much more complex, because of two reasons:
prices are tracked in ticks now
historical prices are stored in the contract now to make accessing them easier, but this means dealing with a struct array. This is the
observations[]
array, created using the Oracle.sol library and implemented on the pair contract.
Because the array is the last slot initialized in the contract (the 9th slot), we know the index of the array is slot - 8
. Structs are stored in reverse and tightly packed within each slot, such that for the struct:
struct Observation {
uint32 blockTimestamp;
int56 tickCumulative;
uint160 secondsPerLiquidityCumulativeX128;
bool initialized;
}
The slot can be decoded into: first two bytes have initialized
then 40 for secondsPerLiquidityCumulativeX128
, 14 for tickCumulative
and 8 for blockTimestamp
. The query I used was:
SELECT timestamp,
location::integer - 8 as index,
-- substr(curr_value, 3, 2) = '0x01' as initialized,
-- hextoint(substr(curr_value, 5, 40)) as secondsPerLiquidityCumulativeX128,
hextoint(substr(curr_value, 45, 14)) as tickCumulative
-- hextoint(substr(curr_value, 59, 8)) as blockTimestamp
FROM EDW_SNOWFLAKE_SECURE_SHARE_1645732350028.raw.storage_diffs
WHERE contract = lower('0x8ad599c3A0ff1De082011EFDDc58f1908eb6e6D8')
and location not like '_[%' and location::integer > 8
AND block >= 12381343
AND status = 1
ORDER BY block asc, order_index asc;
The rest of the data price processing was done in Python.
I also made a query to get daily reads/unique contracts reliant on the Uniswap V3 oracle, found here. First, I had to filter out transactions where there was a call to mint, burn, swap, or flash in the pool. Then, I joined the storage_reads
table with the calls
table on call_id
and tx_hash
to figure out which contracts were reading the tickCumulative
data values, presuming they did so to get ETH prices.
WITH
--ignore calls that affect observation state as that doesn't count as oracle usage
ignore_interactions as (
SELECT
*
--distinct tx_hash
FROM EDW_SNOWFLAKE_SECURE_SHARE_1645732350028.raw.calls
WHERE to_address = lower('0xB4e16d0168e52d35CaCD2c6185b44281Ec28C9Dc')
AND substring(call_data,1,10) IN ('0x6a627842','0x89afcb44','0x022c0d9f') --mint, burn, swap
)
-- query top contracts using calls instead of txs, we count distinct tx_hash since otherwise reads might be cheating with multiple reads per oracle use.
SELECT
call.from_address ,
COUNT(distinct call.tx_hash) as total_times_queried
FROM EDW_SNOWFLAKE_SECURE_SHARE_1645732350028.raw.storage_reads read
LEFT JOIN EDW_SNOWFLAKE_SECURE_SHARE_1645732350028.raw.calls call
ON call.tx_hash = read.tx_hash AND call.call_id = read.call_id
WHERE contract = lower('0xB4e16d0168e52d35CaCD2c6185b44281Ec28C9Dc')
AND location = '10'
AND read.status = 1
AND read.block >= 12381343
AND NOT EXISTS (SELECT 1 FROM ignore_interactions ii --ignore non-oracle interactions
WHERE ii.tx_hash = read.tx_hash)
GROUP BY 1
ORDER BY total_times_queried DESC
LIMIT 100;
*I highly encourage you to check out the longer deep dive into oracle prices here to get a sense of how this data is then applied in real analysis.*
Concluding Thoughts
I believe working with storage and state data is harder than events/function call data at the start because you need more familiarity with Solidity. Many variables are encoding in weird ways to make manipulation easy between smart contracts - so they really aren’t human interpretable at all. There are also a bunch of different storage patterns that people use, which take time to get used to.
However, this data does provide two primary benefits:
allow you to avoid a ton of expensive event/call aggregations, which will become increasingly important for wallet analysis in the future.
it captures cross-contract interoperability and dependencies. I think this enables us to do much deeper ecosystem analysis and risk analysis, as well as more accurate simulations.
I can’t wait for the production release of this product, huge kudos to the TokenFlow insights team (and thanks a bunch for answering all my questions! 🙂 )