From reading and indexing to analysis, briefly summarize the Web3 data indexing track

9/27/2024, 3:41:22 PM
Intermediate
TechnologyAI
This article explores the development process of blockchain data accessibility, compares the architecture and AI technology application characteristics of three data service protocols, The Graph, Chainbase and Space and Time, and points out that blockchain data services are moving towards intelligence and It is developing in the direction of security and will continue to play an important role as industry infrastructure in the future.

This article explores the evolution of blockchain data accessibility, comparing the characteristics of three data service protocols—The Graph, Chainbase, and Space and Time—in terms of architecture and AI technology applications. It points out that blockchain data services are evolving towards increased intelligence and security, and will continue to play a crucial role as foundational infrastructure in the industry in the future.

1. Introduction

Starting from the first wave of dApps in 2017, including Etheroll, ETHLend, and CryptoKitties, we now see a flourishing variety of financial, gaming, and social dApps based on different blockchains. When discussing decentralized on-chain applications, have we ever considered the sources of the various data these dApps utilize in their interactions?

In 2024, the focus is on AI and Web3. In the world of artificial intelligence, data is like the lifeblood for its growth and evolution. Just as plants rely on sunlight and water to thrive, AI systems depend on vast amounts of data to continually “learn” and “think.” Without data, even the most sophisticated AI algorithms are mere castles in the air, unable to unleash their intended intelligence and efficacy.

This article analyzes the evolution of blockchain data indexing from the perspective of data accessibility, comparing the established data indexing protocol The Graph with emerging blockchain data service protocols Chainbase and Space and Time. It particularly explores the similarities and differences in data services and product architecture between these two new protocols that incorporate AI technology.

2. Complexity and simplicity of data index: from blockchain nodes to full-chain database

2.1 Data Sources: Blockchain Nodes

From the moment we start to understand “what is blockchain,” we often come across the phrase: blockchain is a decentralized ledger. Blockchain nodes are the foundation of the entire blockchain network, responsible for recording, storing, and disseminating all on-chain transaction data. Each node possesses a complete copy of the blockchain data, ensuring the decentralization of the network. However, for ordinary users, building and maintaining a blockchain node is not an easy task. This requires not only specialized technical skills but also incurs high hardware and bandwidth costs. Additionally, the query capabilities of ordinary nodes are limited, making it difficult to retrieve data in the format developers require. Therefore, while theoretically anyone can run their own node, in practice, users tend to rely on third-party services.

To address this issue, RPC (Remote Procedure Call) node providers have emerged. These providers handle the costs and management of nodes and offer data through RPC endpoints, allowing users to access blockchain data without building their own nodes. Public RPC endpoints are free but come with rate limits, which may negatively impact the user experience of dApps. Private RPC endpoints offer better performance by reducing congestion, but even simple data retrieval requires substantial back-and-forth communication. This makes them request-heavy and inefficient for complex data queries. Moreover, private RPC endpoints often face scalability challenges and lack compatibility across different networks. However, the standardized API interfaces provided by node providers lower the barriers for users to access on-chain data, laying the groundwork for subsequent data parsing and applications.

2.2 Data Parsing: From Raw Data to Usable Data

The data obtained from blockchain nodes is often raw data that has been encrypted and encoded. While this data retains the integrity and security of the blockchain, its complexity increases the difficulty of data parsing. For ordinary users or developers, directly handling this raw data requires substantial technical knowledge and computational resources.

In this context, the data parsing process becomes particularly important. By parsing complex raw data and transforming it into more understandable and operable formats, users can intuitively comprehend and utilize this data. The success of data parsing directly affects the efficiency and effectiveness of blockchain data applications, making it a critical step in the entire data indexing process.

2.3 Evolution of Data Indexers

As the volume of blockchain data increases, the demand for data indexers has also grown. Indexers play a crucial role in organizing on-chain data and sending it to databases for easy querying. The working principle of an indexer is to index blockchain data and make it readily available through a SQL-like query language (such as GraphQL APIs). By providing a unified interface for querying data, indexers allow developers to quickly and accurately retrieve the information they need using standardized query languages, significantly simplifying the process.

Different types of indexers optimize data retrieval in various ways:

· Complete Node Indexers: These indexers run full blockchain nodes and directly extract data from them, ensuring data completeness and accuracy but requiring substantial storage and processing power.

· Lightweight Indexers: These indexers rely on full nodes to fetch specific data as needed, reducing storage requirements but potentially increasing query time.

· Specialized Indexers: These indexers focus on specific types of data or particular blockchains, optimizing retrieval for specific use cases, such as NFT data or DeFi transactions.

· Aggregated Indexers: These indexers extract data from multiple blockchains and sources, including off-chain information, providing a unified query interface, which is especially useful for multi-chain dApps.

Currently, an Ethereum archive node in the Geth client in archive mode occupies about 13.5 TB of storage space, while under the Erigon client, the archive requirement is around 3 TB. As the blockchain continues to grow, the data storage requirements for archive nodes will also increase. In the face of such vast amounts of data, mainstream indexing protocols not only support multi-chain indexing but also customize data parsing frameworks tailored to different application data needs. For instance, The Graph’s “subgraph” framework is a typical example.

The emergence of indexers significantly enhances the efficiency of data indexing and querying. Compared to traditional RPC endpoints, indexers can efficiently index large amounts of data and support high-speed queries. These indexers allow users to perform complex queries, easily filter data, and analyze it post-extraction. Additionally, some indexers support aggregating data sources from multiple blockchains, avoiding the need to deploy multiple APIs in multi-chain dApps. By running distributed across multiple nodes, indexers provide stronger security and performance while reducing the risks of interruptions and downtimes associated with centralized RPC providers.

In contrast, indexers enable users to obtain the information they need directly using predefined query languages without having to deal with the underlying complex data. This mechanism significantly improves the efficiency and reliability of data retrieval, representing an important innovation in blockchain data access.

2.4 Full-Chain Databases: Aligning Towards Streaming First

Using indexed nodes to query data usually means that APIs become the sole gateway for digesting on-chain data. However, when a project enters the scaling phase, it often requires more flexible data sources, which standardized APIs cannot provide. As application demands become more complex, primary data indexers with their standardized indexing formats gradually struggle to meet the increasingly diverse querying needs, such as searching, cross-chain access, or off-chain data mapping.

In modern data pipeline architecture, a “stream-first” approach has become a solution to the limitations of traditional batch processing, enabling real-time data ingestion, processing, and analysis. This paradigm shift allows organizations to respond immediately to incoming data, yielding insights and making decisions almost instantaneously. Similarly, the development of blockchain data service providers is progressing toward constructing blockchain data streams. Traditional indexing service providers have successively launched products that obtain real-time blockchain data through data streams, such as The Graph’s Substreams and Goldsky’s Mirror, as well as real-time data lakes like Chainbase and SubSquid that generate data streams based on blockchains.

These services aim to meet the demand for real-time parsing of blockchain transactions and to provide more comprehensive querying capabilities. Just as the “stream-first” architecture revolutionizes data processing and consumption in traditional data pipelines by reducing latency and enhancing responsiveness, these blockchain data stream providers also seek to support the development of more applications and assist in on-chain data analysis through more advanced and mature data sources.

By redefining the challenges of on-chain data from the perspective of modern data pipelines, we can view the management, storage, and provision of on-chain data from a new angle, realizing its full potential. When we start to see subgraphs and Ethereum ETL indexing services as data streams within the data pipeline rather than final outputs, we can envision a possible world where high-performance datasets are tailored for any business use case.

3. AI + Database? In-Depth Comparison of The Graph, Chainbase, and Space and Time

3.1 The Graph

The Graph network achieves multi-chain data indexing and query services through a decentralized network of nodes, enabling developers to conveniently index blockchain data and build decentralized applications. Its primary product models include the data query execution market and the data indexing cache market, both of which serve the product query needs of users. The data query execution market specifically refers to consumers paying suitable index nodes for the data they require, while the data indexing cache market involves index nodes allocating resources based on factors such as historical indexing popularity of subgraphs, the query fees collected, and the demand from on-chain curators for subgraph outputs.

Subgraphs are the fundamental data structures within The Graph network. They define how to extract and transform data from the blockchain into a queryable format (e.g., GraphQL schema). Anyone can create a subgraph, and multiple applications can reuse these subgraphs, enhancing data reusability and operational efficiency.

The Graph network consists of four key roles: Indexers, Delegators, Curators, and Developers, all of which work together to provide data support for Web3 applications. Their respective responsibilities are as follows:

· Indexers: Indexers are node operators within The Graph network who participate by staking GRT (The Graph’s native token). They provide indexing and query processing services.

· Delegators: Delegators are users who stake GRT tokens to support the operation of index nodes. They earn a portion of the rewards based on the index nodes they delegate to.

· Curators: Curators are responsible for signaling which subgraphs should be indexed by the network. They help ensure that valuable subgraphs are prioritized for processing.

· Developers: Unlike the previous three roles, Developers are the demand side and are the primary users of The Graph. They create and submit subgraphs to The Graph network, waiting for the network to fulfill their data needs.

3.1 The Graph

The Graph has now fully transitioned to a decentralized subgraph hosting service, with economic incentives flowing between different participants to ensure the system’s operation:

· Indexer Rewards: Indexers earn income through consumer query fees and a portion of GRT token block rewards.

· Delegator Rewards: Delegators receive a share of rewards from the indexers they support.

· Curator Rewards: If curators signal valuable subgraphs, they can earn a portion of the query fees.

In fact, The Graph’s products are rapidly evolving in the AI wave. As one of the core development teams in The Graph ecosystem, Semiotic Labs has been focused on leveraging AI technology to optimize indexing pricing and user query experience. Currently, the tools developed by Semiotic Labs, such as AutoAgora, Allocation Optimizer, and AgentC, enhance various aspects of the ecosystem’s performance.

· AutoAgora introduces a dynamic pricing mechanism that adjusts prices in real time based on query volume and resource usage, optimizing pricing strategies to ensure indexer competitiveness and maximize revenue.

· Allocation Optimizer addresses the complex issues of subgraph resource allocation, helping indexers achieve optimal resource configuration to enhance revenue and performance.

· AgentC is an experimental tool that allows users to access The Graph’s blockchain data using natural language, thereby improving the user experience.

The application of these tools has allowed The Graph to further enhance system intelligence and user-friendliness with AI assistance.

3.2 Chainbase

Chainbase is a comprehensive data network that integrates all blockchain data into a single platform, making it easier for developers to build and maintain applications. Its unique features include:

· Real-time Data Lake: Chainbase provides a real-time data lake specifically for blockchain data streams, allowing instant access to data as it is generated.

· Dual-chain Architecture: Chainbase is built on Eigenlayer AVS, creating an execution layer that runs in parallel with CometBFT’s consensus algorithm. This design enhances cross-chain data programmability and composability, supporting high throughput, low latency, and finality, while improving network security through a dual-staking model.

· Innovative Data Format Standard: Chainbase introduces a new data format standard called “manuscripts,” optimizing the structuring and utilization of data in the crypto industry.

· Cryptoworld Model: With its extensive blockchain data resources, Chainbase combines AI model technology to create AI models that effectively understand, predict, and interact with blockchain transactions. The basic model, Theia, is now available for public use.

These features set Chainbase apart in blockchain indexing protocols, focusing on real-time data accessibility, innovative data formats, and the creation of smarter models through the integration of on-chain and off-chain data to enhance insights.

Chainbase’s AI model, Theia, is a key highlight that differentiates it from other data service protocols. Based on NVIDIA’s DORA model, Theia learns and analyzes crypto patterns by integrating on-chain and off-chain data along with spatiotemporal activities. Through causal reasoning, it responds to deepen the exploration of the potential value and patterns of on-chain data, providing users with more intelligent data services.

AI-enabled data services have transformed Chainbase from merely a blockchain data service platform into a more competitive intelligent data service provider. With robust data resources and proactive AI analysis, Chainbase can offer broader data insights and optimize users’ data processing workflows.

3.3 Space and Time

Space and Time (SxT) aims to create a verifiable computation layer that extends zero-knowledge proofs on a decentralized data warehouse, providing trustworthy data processing for smart contracts, large language models, and enterprises. Space and Time has recently secured $20 million in its latest Series A funding round, led by Framework Ventures, Lightspeed Faction, Arrington Capital, and Hivemind Capital.

In the field of data indexing and verification, Space and Time introduces a new technical approach—Proof of SQL. This is an innovative zero-knowledge proof (ZKP) technology developed by Space and Time that ensures SQL queries executed on the decentralized data warehouse are tamper-proof and verifiable. When a query is run, Proof of SQL generates a cryptographic proof that verifies the integrity and accuracy of the query results. This proof is appended to the query results, allowing any verifier (such as smart contracts) to independently confirm that the data has not been tampered with during processing. Traditional blockchain networks usually rely on consensus mechanisms to verify data authenticity, whereas Space and Time’s Proof of SQL implements a more efficient data verification method. Specifically, in Space and Time’s system, one node is responsible for data acquisition while other nodes use zk technology to verify the authenticity of that data. This approach reduces resource consumption from multiple nodes redundantly indexing the same data to reach consensus, thereby enhancing overall system performance. As this technology matures, it serves as a cornerstone for traditional industries focusing on data reliability to build products based on blockchain data.

At the same time, SxT has been closely collaborating with Microsoft’s AI joint innovation lab to accelerate the development of generative AI tools, allowing users to easily process blockchain data through natural language. Currently, in Space and Time Studio, users can input natural language queries, and the AI will automatically convert them into SQL and execute the query on behalf of the user to present the final results needed.

3.4 Comparison of Differences

4. Conclusion and outlook

In summary, the blockchain data indexing technology has evolved from initial node data sources, through the development of data parsing and indexers, to an AI-enabled full-chain data service, marking a gradual improvement process. This continuous evolution of technology not only enhances the efficiency and accuracy of data access but also provides users with an unprecedented intelligent experience.

Looking ahead, with the ongoing development of new technologies such as AI and zero-knowledge proofs, blockchain data services will become even more intelligent and secure. We have reason to believe that blockchain data services will continue to play a vital role as infrastructure, providing strong support for progress and innovation in the industry.

Disclaimer:

  1. This article is reproduced from [Trustless Labs], the copyright belongs to the original author [Trustless Labs], if you have any objections to the reprint, please contact the Gate Learn team, and the team will handle it as soon as possible according to relevant procedures.

  2. Disclaimer: The views and opinions expressed in this article represent only the author’s personal views and do not constitute any investment advice.

  3. Other language versions of the article are translated by the Gate Learn team and are not mentioned in Gate.io, the translated article may not be reproduced, distributed or plagiarized.

* The information is not intended to be and does not constitute financial advice or any other recommendation of any sort offered or endorsed by Gate.
* This article may not be reproduced, transmitted or copied without referencing Gate. Contravention is an infringement of Copyright Act and may be subject to legal action.

Share

Crypto Calendar
Tokens Unlock
Grass will unlock 181,000,000 GRASS tokens on October 28th, constituting approximately 74.21% of the currently circulating supply.
GRASS
-5.91%
2025-10-27
Mainnet v.2.0 Launch
DuckChain Token will launch mainnet v.2.0 in October.
DUCK
-8.39%
2025-10-27
StVaults Launch
Lido has announced that stVaults will go live on mainnet in October as part of the Lido v.3.0 upgrade. In the meantime, users can explore the features on the testnet. The release aims to enhance Ethereum staking infrastructure through new modular vault architecture.
LDO
-5.66%
2025-10-27
AMA
Sidus will host an AMA in October.
SIDUS
-4.2%
2025-10-27
Forte Network Upgrade
Flow announces the Forte upgrade, set to launch in October, introducing tools and performance enhancements to improve developer experience and enable consumer-ready on-chain applications with AI. The update includes new features for the Cadence language, a library of reusable components, protocol improvements, and refined tokenomics. Current and new builders on Flow will release apps and upgrades leveraging the latest capabilities. Additional details will be shared on August 14 at Pragma New York ahead of the ETHGlobal hackathon.
FLOW
-2.81%
2025-10-27
sign up guide logosign up guide logo
sign up guide content imgsign up guide content img
Start Now
Sign up and get a
$100
Voucher!
Create Account

Related Articles

Blockchain Profitability & Issuance - Does It Matter?
Intermediate

Blockchain Profitability & Issuance - Does It Matter?

In the field of blockchain investment, the profitability of PoW (Proof of Work) and PoS (Proof of Stake) blockchains has always been a topic of significant interest. Crypto influencer Donovan has written an article exploring the profitability models of these blockchains, particularly focusing on the differences between Ethereum and Solana, and analyzing whether blockchain profitability should be a key concern for investors.
6/17/2024, 3:14:00 PM
Arweave: Capturing Market Opportunity with AO Computer
Beginner

Arweave: Capturing Market Opportunity with AO Computer

Decentralised storage, exemplified by peer-to-peer networks, creates a global, trustless, and immutable hard drive. Arweave, a leader in this space, offers cost-efficient solutions ensuring permanence, immutability, and censorship resistance, essential for the growing needs of NFTs and dApps.
6/8/2024, 2:46:17 PM
 The Upcoming AO Token: Potentially the Ultimate Solution for On-Chain AI Agents
Intermediate

The Upcoming AO Token: Potentially the Ultimate Solution for On-Chain AI Agents

AO, built on Arweave's on-chain storage, achieves infinitely scalable decentralized computing, allowing an unlimited number of processes to run in parallel. Decentralized AI Agents are hosted on-chain by AR and run on-chain by AO.
6/18/2024, 3:14:52 AM
In-depth Analysis of API3: Unleashing the Oracle Market Disruptor with OVM
Intermediate

In-depth Analysis of API3: Unleashing the Oracle Market Disruptor with OVM

Recently, API3 secured $4 million in strategic funding, led by DWF Labs, with participation from several well-known VCs. What makes API3 unique? Could it be the disruptor of traditional oracles? Shisijun provides an in-depth analysis of the working principles of oracles, the tokenomics of the API3 DAO, and the groundbreaking OEV Network.
6/25/2024, 1:56:05 AM
AI Agents in DeFi: Redefining Crypto as We Know It
Intermediate

AI Agents in DeFi: Redefining Crypto as We Know It

This article focuses on how AI is transforming DeFi in trading, governance, security, and personalization. The integration of AI with DeFi has the potential to create a more inclusive, resilient, and future-oriented financial system, fundamentally redefining how we interact with economic systems.
11/28/2024, 3:45:01 AM
Dimo: Decentralized Revolution of Vehicle Data
Beginner

Dimo: Decentralized Revolution of Vehicle Data

Dimo is a car IoT platform built on Polygon, allowing car owners to collect and share vehicle data such as mileage, speed, and location, in exchange for DIMO tokens as rewards. The platform enables real-time monitoring, management, and monetization of vehicle data through integration with hardware such as AutoPi OBDII devices. The DIMO token, based on ERC-20, aims to incentivize user participation, with governance features included in its token economy. Dimo also collaborates with IoTeX, integrating W3bstream technology to support Web3 developers' access to vehicle data, jointly creating a new ecosystem for mobile travel. With two rounds of funding raising $20.5 million, the Dimo project has a fixed token supply, with circulating supply gradually increasing.
5/6/2024, 12:37:57 PM