> ## Documentation Index
> Fetch the complete documentation index at: https://docs.brighthive.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Ingestion

> Upload files directly to S3 or connect 300+ sources via Airbyte — with automatic schema discovery and catalog registration.

## Overview

Brighthive supports two primary ingestion paths: direct S3 upload with automatic schema discovery, and Airbyte connectors for external data sources. Both paths automatically register metadata in Neo4j, making ingested data immediately discoverable by BrightAgent and the webapp.

## Direct S3 Upload

The primary ingestion path for most organizations. Upload files to your dedicated S3 data lake and the platform handles everything else:

```mermaid theme={null}
graph LR
    A[Upload to S3] --> B[EventBridge Trigger]
    B --> C[Step Functions]
    C --> D[Glue Crawler]
    D --> E[Schema Discovery]
    E --> F[Neo4j Metadata Sync]
    F --> G[Available for Querying]
```

1. **Upload** — Files land in the organization's S3 data lake (`brighthive-raw/` bucket).
2. **EventBridge detects the upload** — S3 `ObjectCreated` events trigger an EventBridge rule.
3. **Step Functions orchestrate ingestion** — The `DataIngestionStateMachine` coordinates the pipeline.
4. **Glue crawlers discover the schema** — Automatically infers column names, data types, partitions, and format.
5. **Metadata syncs to Neo4j** — A Lambda function updates the platform's knowledge graph with schema, row counts, and location.
6. **Data is queryable** — Immediately available via Redshift Spectrum external tables and BrightAgent.

### Supported File Formats

* CSV, Parquet, JSON, Avro, ORC, Excel

## Airbyte (Optional)

For organizations that need to ingest data from external systems, Brighthive offers a self-hosted Airbyte instance:

<CardGroup cols={2}>
  <Card title="300+ Connectors" icon="plug">
    Connect to Shopify, HubSpot, PostgreSQL, MySQL, Salesforce, Google Analytics, and hundreds more sources.
  </Card>

  <Card title="Automated Setup" icon="wand-magic-sparkles">
    Sources and connections are created programmatically through the GraphQL API — no manual Airbyte configuration needed.
  </Card>

  <Card title="Sync Scheduling" icon="clock">
    Configure sync schedules per source — hourly, daily, or on-demand.
  </Card>

  <Card title="Self-Hosted" icon="server">
    Runs on EC2 within the organization's dedicated AWS account for data isolation and security.
  </Card>
</CardGroup>

## Glue Data Catalog

AWS Glue is central to the ingestion pipeline:

* **Crawlers** automatically run when new data arrives, detecting schema changes and new partitions.
* **Data Catalog** stores table schemas, column metadata, and S3 locations.
* **Cross-account access** — Workspace Redshift reads the organization's Glue catalog via the `OrgDataCatalogRole` IAM role.

## What Happens After Ingestion

Once data is ingested and cataloged:

* **Neo4j** has full metadata — schema, location, row count, last updated timestamp.
* **Redshift** can query the data via Spectrum external tables.
* **BrightAgent** can discover and analyze the data through natural language.
* **Snowflake sync** is triggered if the organization uses Snowflake (via the `SnowflakeIngestionStateMachine`).
