Skip to main content

Overview

Brighthive supports two primary ingestion paths: direct S3 upload with automatic schema discovery, and Airbyte connectors for external data sources. Both paths automatically register metadata in Neo4j, making ingested data immediately discoverable by BrightAgent and the webapp.

Direct S3 Upload

The primary ingestion path for most organizations. Upload files to your dedicated S3 data lake and the platform handles everything else:
  1. Upload — Files land in the organization’s S3 data lake (brighthive-raw/ bucket).
  2. EventBridge detects the upload — S3 ObjectCreated events trigger an EventBridge rule.
  3. Step Functions orchestrate ingestion — The DataIngestionStateMachine coordinates the pipeline.
  4. Glue crawlers discover the schema — Automatically infers column names, data types, partitions, and format.
  5. Metadata syncs to Neo4j — A Lambda function updates the platform’s knowledge graph with schema, row counts, and location.
  6. Data is queryable — Immediately available via Redshift Spectrum external tables and BrightAgent.

Supported File Formats

  • CSV, Parquet, JSON, Avro, ORC, Excel

Airbyte (Optional)

For organizations that need to ingest data from external systems, Brighthive offers a self-hosted Airbyte instance:

300+ Connectors

Connect to Shopify, HubSpot, PostgreSQL, MySQL, Salesforce, Google Analytics, and hundreds more sources.

Automated Setup

Sources and connections are created programmatically through the GraphQL API — no manual Airbyte configuration needed.

Sync Scheduling

Configure sync schedules per source — hourly, daily, or on-demand.

Self-Hosted

Runs on EC2 within the organization’s dedicated AWS account for data isolation and security.

Glue Data Catalog

AWS Glue is central to the ingestion pipeline:
  • Crawlers automatically run when new data arrives, detecting schema changes and new partitions.
  • Data Catalog stores table schemas, column metadata, and S3 locations.
  • Cross-account access — Workspace Redshift reads the organization’s Glue catalog via the OrgDataCatalogRole IAM role.

What Happens After Ingestion

Once data is ingested and cataloged:
  • Neo4j has full metadata — schema, location, row count, last updated timestamp.
  • Redshift can query the data via Spectrum external tables.
  • BrightAgent can discover and analyze the data through natural language.
  • Snowflake sync is triggered if the organization uses Snowflake (via the SnowflakeIngestionStateMachine).