Overview
Brighthive supports two primary ingestion paths: direct S3 upload with automatic schema discovery, and Airbyte connectors for external data sources. Both paths automatically register metadata in Neo4j, making ingested data immediately discoverable by BrightAgent and the webapp.Direct S3 Upload
The primary ingestion path for most organizations. Upload files to your dedicated S3 data lake and the platform handles everything else:- Upload — Files land in the organization’s S3 data lake (
brighthive-raw/bucket). - EventBridge detects the upload — S3
ObjectCreatedevents trigger an EventBridge rule. - Step Functions orchestrate ingestion — The
DataIngestionStateMachinecoordinates the pipeline. - Glue crawlers discover the schema — Automatically infers column names, data types, partitions, and format.
- Metadata syncs to Neo4j — A Lambda function updates the platform’s knowledge graph with schema, row counts, and location.
- Data is queryable — Immediately available via Redshift Spectrum external tables and BrightAgent.
Supported File Formats
- CSV, Parquet, JSON, Avro, ORC, Excel
Airbyte (Optional)
For organizations that need to ingest data from external systems, Brighthive offers a self-hosted Airbyte instance:300+ Connectors
Connect to Shopify, HubSpot, PostgreSQL, MySQL, Salesforce, Google Analytics, and hundreds more sources.
Automated Setup
Sources and connections are created programmatically through the GraphQL API — no manual Airbyte configuration needed.
Sync Scheduling
Configure sync schedules per source — hourly, daily, or on-demand.
Self-Hosted
Runs on EC2 within the organization’s dedicated AWS account for data isolation and security.
Glue Data Catalog
AWS Glue is central to the ingestion pipeline:- Crawlers automatically run when new data arrives, detecting schema changes and new partitions.
- Data Catalog stores table schemas, column metadata, and S3 locations.
- Cross-account access — Workspace Redshift reads the organization’s Glue catalog via the
OrgDataCatalogRoleIAM role.
What Happens After Ingestion
Once data is ingested and cataloged:- Neo4j has full metadata — schema, location, row count, last updated timestamp.
- Redshift can query the data via Spectrum external tables.
- BrightAgent can discover and analyze the data through natural language.
- Snowflake sync is triggered if the organization uses Snowflake (via the
SnowflakeIngestionStateMachine).

