LogoLogo
WebsiteGithubAPIsJoin Research & Development
  • L3A Protocol Overview
    • Overview
    • Vision & Mission
    • L3A Protocol Architecture
    • Transparency
    • Research
  • Streaming service
    • Overview
    • Supported Feeds and Symbols
    • Websocket API
    • FAQs
    • Schema Reference
  • Query service
    • Overview
      • Superset
      • Data products examples
    • Technical Architecture
    • GraphQL
    • Historical data-as-a-service
    • Beta Testing
      • Onboarding toolkit
      • Beta tester FAQ
      • Feedback form
  • Decentralized Data Infrastructure as a Service (DDIaaS)
    • Overview
    • Nodes
    • SuperNodes
    • Decentralized Data Middleware
      • PoC & Roadmap
    • FAQs
  • Technology
    • Unified API
    • Universal Data Collector (UDC)
    • Decentralized Data Mesh (DDM)
    • Tamper-proof ETL
    • OpenMetrics
  • Infrastructure
    • L3A v3 Overview
    • Data Flow
    • Connectors
      • Add New Connectors
      • Community Contribution
    • Nodes & SuperNodes
  • Transparency and governance
    • Overview
    • Open Infrastructure
    • Open R&D
    • Open Code
    • Open Audits
  • Researchers and Developers
    • Join R&D Community
    • Participate Beta Tests
    • Become a Verified Contributor
      • Developers
      • Researchers
      • Academia
    • Developer Grants
    • Career
  • Community
    • Join Community
    • Updates
    • Events
Powered by GitBook
On this page
  1. Infrastructure

Data Flow

PreviousL3A v3 OverviewNextConnectors

Last updated 2 years ago

We use Apache Kafka for transporting our messages through our pipeline. Kafka is inherently distributed, scalable, and fault tolerant – essential for our high throughput, low latency, and critical data. The L3 Atom infrastructure has been built from the ground up for horizontal scalability, leveraging Kafka’s parallelism and distributed architecture to ensure we can always keep up with intense throughputs.

CEX data, once collected, is all produced into a raw topic, which has semi-structured JSON that’s partitioned by exchange-symbol tuples. Blockchain data is already structured, so it’s produced to different topics in an Apache Avro format. Apache Avro is a data format that is JSON-like, but preserves schema information and minimizes sizes.

At this point, a network of stream processors consumes the data. This is a massive consumer group that ingests all raw market data to process it. Thanks to partitioning, all of these processes run in parallel, yet events for specific exchanges and symbols are guaranteed to be processed in order. All raw data is also archived in object storage as compressed JSON. Blockchain data is also processed, specifically smart contract events, which can have important data points included within that need to be processed to be revealed.

Once processed, the data is sorted and produced to a collection of standardized topics, which have consistent schemas for each message. At this point, all of the data is structured and so is encoded as Avro.

Processed data is archived into object storage by a process which converts the Avro data into Apache Parquet, another data format which is column based and has tiny file sizes, suitable for structured data that is to be placed into object storage. This periodically collect chunks of data, processes it into a single file, and stores it based on a datetime partition. The file structure organises events by year, month, day, and hour, for efficient indexing and retrieval.

Another process places the data into a Postgres database, where it is indexed and queryable. PowerQuery connects to this database directly, where users can write arbitrary SQL and perform advanced analytics. This process processes up to 1000 rows at a time, placing each event topic into its own table.

Finally, the data is consumed by our websocket broadcaster consumer group which streams out market events to our users. These are typical Kafka consumers which micro-batch messages to increase throughput. The consumer group is immensely scalable, and constant rebalances occur as the group scales up and down to meet throughput demands and handle any drops.