# Datafold > This integration supports both Azure Data Lake Storage and Azure Blob Storage. --- # Source: https://docs.datafold.com/integrations/databases/adls.md # Azure Data Lake Storage (ADLS) This integration supports both Azure Data Lake Storage and Azure Blob Storage. **Steps to complete:** 1. [Create an app and service principal in Microsoft Entra](#create-an-app-and-service-principal-in-microsoft-entra) 2. [Configure your data connection in Datafold](#configure-your-data-connection-in-datafold) 3. [Create your first file diff](#create-your-first-file-diff) ## Create an app and service principal in Microsoft Entra Create an app and service principal in Entra using a client secret (not certificate). Check out [Microsoft's documentation](https://learn.microsoft.com/en-us/entra/architecture/service-accounts-principal) on this topic if you need help. Use client secret ## Configure your data connection in Datafold ADLS Data Connection | Field Name | Description | | --------------- | -------------------------------------------------------------------------------------------------------- | | Connection name | The name you'd like to give to this connection in Datafold | | Account Name | This is in the URL of any filepath in ADLS, e.g. `.dfs.core.windows.net//` | | Client ID | The client ID of the app you created in Microsoft Entra | | Client Secret | The client secret of the app you created in Microsoft Entra | | Tenant ID | The tenant ID of the app you created in Microsoft Entra | ## Create your first file diff For general guidance on how file diffs work in Datafold, check out our [file diffing docs](/data-diff/file-diffing). When creating a diff, note that the file path you provide may differ depending on whether you're using ADLS or Blob Storage. For example: * ADLS: `abfss:////.` * Blob Storage: `az:////.` --- # Source: https://docs.datafold.com/integrations/databases/amazon-s3.md # Amazon S3 **Steps to complete:** 1. [Create a user with access to S3](/integrations/databases/google-cloud-storage#create-a-service-account) 2. [Assign the user to the S3 bucket](/integrations/databases/google-cloud-storage#service-account-access-and-permissions) 3. [Create an access key for the user](/integrations/databases/google-cloud-storage#generate-a-service-account-key) 4. [Configure your data connection in Datafold](/integrations/databases/google-cloud-storage#configure-in-datafold) ## Create a user with access to S3 To connect your Amazon S3 bucket, you will need to create a user for Datafold to use. * Navigate to the [AWS Console](https://console.aws.amazon.com/). * Click on the search bar in the top header, then find **IAM** service and click on it. * Click on the **Users** item of the Access Management section. * Click on the **Create user** button. * Create a user named `Datafold`. * Assign the user to the `AmazonS3FullAccess` policy. * When done, keep ARN of the user handy as you'll need it in the next step. ## Assign the user to the S3 bucket * Go to S3 panel and select the bucket. * Click on the **Permissions** tab. * Click on **Edit** next to the **Bucket Policy**. * Add the following policy: ```json theme={null} { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam:::user/Datafold" // Replace with your user's ARN }, "Action": [ "s3:GetObject", "s3:PutObject" // Optional: Only needed if you're planning to use this data connection as a destination for materialized diff results. ], "Resource": [ "arn:aws:s3:::your-bucket-name/*", // Replace with your bucket's ARN "arn:aws:s3:::your-bucket-name" // Replace with your bucket's ARN ] } ] } ``` The Datafold user requires the following roles and permissions: * **s3:GetObject** for read access. * **s3:PutObject** for write access if you're planning to use this data connection as a destination for materialized diff results. ## Create an access key for the user Next, go back to the **IAM** page to generate a key for Datafold. * Click on the **Users** page. * Click on the **Datafold** user. * Click on the **Security Credentials** tab. * Click on **Create access key** and select **Create new access key**. * Select **JSON** and click **Create**. ## Configure in Datafold | Field Name | Description | | --------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------- | | Connection name | A name given to the data connection within Datafold | | Bucket Name | The name of the bucket you want to connect to. | | Bucket region | The region of the bucket you want to connect to. | | Key ID | The key file generated in the [Create an access key for the user](#create-an-access-key-for-the-user) step | | Secret Access Key | The secret access key generated in the [Create an access key for the user](#create-an-access-key-for-the-user) step | | Directory for writing diff results | Optional. The directory in the bucket where diff results will be written. Service account should have write access to this directory. | | Default maximum number of rows to include in diff results | Optional. The maximum number of rows that a file with materialized results will contain. | Click **Create**. Your data connection is ready! --- # Source: https://docs.datafold.com/deployment-testing/getting-started/universal/api.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # API > Learn how to set up and configure Datafold's API for CI/CD testing. ## 1. Create a repository integration Integrate your code repository using the appropriate [integration](/integrations/code-repositories). ## 2. Create an API integration In the Datafold app, create an API integration. ## 3. Set up the API integration Complete the configuration by specifying the following fields: ### Basic settings | Field Name | Description | | ------------------ | --------------------------------------------------------- | | Configuration name | Choose a name for your for your Datafold dbt integration. | | Repository | Select the repository you configured in step 1. | | Data Source | Select the data source your repository writes to. | ### Advanced settings: Configuration | Field Name | Description | | ------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Diff Hightouch Models | Run data diffs for Hightouch models affected by your PR. | | CI fails on primary key issues | If null or duplicate primary keys exist, CI will fail. | | Pull Request Label | When this is selected, the Datafold CI process will only run when the 'datafold' label has been applied. | | CI Diff Threshold | Data Diffs will only be run automatically for given CI Run if the number of diffs doesn't exceed this threshold. | | Custom base branch | If defined, the Datafold CI process will only run on pull requests with the specified base branch. | | Files to ignore | Datafold CI diffs all changed models in the PR if at least one modified file doesn’t match the ignore pattern. Datafold CI doesn’t run in the PR if all modified files should be ignored. ([Additional details.](/deployment-testing/configuration/datafold-ci/on-demand)) | ### Advanced settings: Sampling | Field Name | Description | | ------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Enable sampling | Enable sampling for data diffs to optimize analyzing large datasets. | | Sampling tolerance | The tolerance to apply in sampling for all data diffs. | | Sampling confidence | The confidence to apply when sampling. | | Sampling threshold | Sampling will be disabled automatically if tables are smaller than specified threshold. If unspecified, default values will be used depending on the Data Source type. | ## 4. Obtain a Datafold API Key and CI config ID Generate a new Datafold API Key and obtain the CI config ID from the CI API integration settings page: You will need these values later on when setting up the CI Jobs. ## 5. Install Datafold SDK into your Python environment ```Bash theme={null} pip install datafold-sdk ``` ## 6. Configure your CI script(s) with the Datafold SDK Using the Datafold SDK, configure your CI script(s) to use the Datafold SDK `ci submit` command. The example below should be adapted to match your specific use-case. ```Bash theme={null} datafold ci submit --ci-config-id --pr-num --diffs ./diffs.json ``` Since Datafold cannot infer which tables have changed, you'll need to manually provide this information in a specific `json` file format. Datafold can then determine which models to diff in a CI run based on the `diffs.json` you pass in to the Datafold SDK `ci submit` command. ```Bash theme={null} [ { "prod": "MY.PROD.TABLE", // Production table to compare PR changes against "pr": "MY.PR.TABLE", // Changed table containing data modifications in the PR "pk": ["MY", "PK", "LIST"], // Primary key; can be an empty array // These fields are not required and can be omitted from the JSON file: "include_columns": ["COLUMNS", "TO", "INCLUDE"], "exclude_columns": ["COLUMNS", "TO", "EXCLUDE"] } ] ``` Note: The `JSON` file is optional and you can also achieve the same effect by using standard input (stdin) as shown here. However, for brevity, we'll use the `JSON` file approach in this example: ```Bash theme={null} datafold ci submit \ --ci-config-id \ --pr-num <<- EOF [{ "prod": "MY.PROD.TABLE", "pr": "MY.PR.TABLE", "pk": ["MY", "PK", "LIST"] }] ``` Implementation details will vary depending on [which CI tool](#ci-implementation-tools) you use. Please review the following instructions and examples for your organization's CI tool. **NOTE** Populating the `diffs.json` file is specific to your use case and therefore out of scope for this guide. The only requirement is to adhere to the `JSON` schema structure explained above. ## CI Implementation Tools We've created guides and templates for three popular CI tools. **HAVING TROUBLE SETTING UP DATAFOLD IN CI?** We're here to help! Please [reach out and chat with a Datafold Solutions Engineer](https://www.datafold.com/booktime). To add Datafold to your CI tool, add `datafold ci submit` step in your PR CI job. ```Bash theme={null} name: Datafold PR Job # Run this job when a commit is pushed to any branch except main on: pull_request: push: branches: - '!main' jobs: run: runs-on: ubuntu-20.04 # your image will vary steps: - name: Install Datafold SDK run: pip install -q datafold-sdk # ... - name: Upload what to diff to Datafold run: datafold ci submit --ci-config-id --pr-num ${PR_NUM} --diffs env: # env variables used by Datafold SDK internally DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} DATAFOLD_HOST: ${DATAFOLD_HOST} # For Dedicated Cloud/private deployments of Datafold, # Set the "https://custom.url.datafold.com" variable as the base URL as an environment variable, either as a string or a project variable # There are multiple ways to get the PR_NUM, this is just a simple example PR_NUM: ${{ github.event.number }} ``` Be sure to replace `` with the [CI config ID](#4-obtain-a-datafold-api-key-and-ci-config-id) value. **NOTE** It is beyond the scope of this guide to provide guidance on generating the ``, as it heavily depends on your specific use case. However, ensure that the generated file adheres to the required schema outlined above. Finally, store [your Datafold API Key](#4-obtain-a-datafold-api-key-and-ci-config-id) as a secret named `DATAFOLD_API_KEY` [in your GitHub repository settings](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository). Once you've completed these steps, Datafold will run data diffs between production and development data on the next GitHub Actions CI run. ```Bash theme={null} version: 2.1 jobs: artifacts-job: filters: branches: only: main # or master, or the name of your default branch docker: - image: cimg/python:3.9 # your image will vary env: # env variables used by Datafold SDK internally DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} DATAFOLD_HOST: ${DATAFOLD_HOST} # For Dedicated Cloud/private deployments of Datafold, # Set the "https://custom.url.datafold.com" variable as the base URL as an environment variable, either as a string or a project variable, per https://circleci.com/docs/set-environment-variable/ # There are multiple ways to get the PR_NUM, this is just a simple example PR_NUM: ${{ github.event.number }} steps: - checkout - run: name: "Install Datafold SDK" command: pip install -q datafold-sdk - run: name: "Upload what to diff to Datafold" command: datafold ci submit --ci-config-id --pr-num ${CIRCLE_PULL_REQUEST} --diffs ``` Be sure to replace `` with the [CI config ID](#4-obtain-a-datafold-api-key-and-ci-config-id) value. **NOTE** It is beyond the scope of this guide to provide guidance on generating the ``, as it heavily depends on your specific use case. However, ensure that the generated file adheres to the required schema outlined above. Then, enable [**Only build pull requests**](https://circleci.com/docs/oss#only-build-pull-requests) in CircleCI. This ensures that CI runs on pull requests and production, but not on pushes to other branches. Finally, store [your Datafold API Key](#4-obtain-a-datafold-api-key-and-ci-config-id) as a secret named `DATAFOLD_API_KEY` [your CircleCI project settings.](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository). Once you've completed these steps, Datafold will run data diffs between production and development data on the next CircleCI run. ```Bash theme={null} image: name: ghcr.io/dbt-labs/dbt-core:1.x # your name will vary entrypoint: [ "" ] variables: # env variables used by Datafold SDK internally DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} DATAFOLD_HOST: ${DATAFOLD_HOST} # For Dedicated Cloud/private deployments of Datafold, # Set the "https://custom.url.datafold.com" variable as the base URL as an environment variable, either as a string or a project variable # There are multiple ways to get the PR_NUM, this is just a simple example PR_NUM: ${{ github.event.number }} run_pipeline: stage: test before_script: - pip install -q datafold-sdk script: # Upload what to diff to Datafold - datafold ci submit --ci-config-id --pr-num $CI_MERGE_REQUEST_ID --diffs rules: - if: '$CI_PIPELINE_SOURCE == "merge_request_event"' ``` Be sure to replace `` with the [CI config ID](#4-obtain-a-datafold-api-key-and-ci-config-id) value. **NOTE** It is beyond the scope of this guide to provide guidance on generating the ``, as it heavily depends on your specific use case. However, ensure that the generated file adheres to the required schema outlined above. Finally, store [your Datafold API Key](#4-obtain-a-datafold-api-key-and-ci-config-id) as a secret named `DATAFOLD_API_KEY` [in your GitLab project's settings](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository). Once you've completed these steps, Datafold will run data diffs between production and development data on the next GitLab CI run. ## Optional CI Configurations and Strategies ### Skip Datafold in CI To skip the Datafold step in CI, include the string `datafold-skip-ci` in the last commit message. --- # Source: https://docs.datafold.com/integrations/databases/athena.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Athena **Steps to complete:** 1. [Create an S3 bucket](/integrations/databases/athena#create-s3-bucket) 2. [Run SQL Script for permissions](/integrations/databases/athena#run-sql-script) 3. [Configure your data connection in Datafold](/integrations/databases/athena#configure-in-datafold) ### Create an S3 bucket If you don't already have an S3 bucket for your cluster, you'll need to create one. Datafold uses this bucket to create temporary tables and store data in it. You can learn how to create an S3 bucket in AWS by referring to the [AWS documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html). ### Run SQL Script and Create Schema for Datafold To connect to AWS Athena, you must generate an `AWS Access Key ID` and an `AWS Secret Access Key`. These keys provide read-only access to all tables in all schemas and write access to the Datafold-specific schema for temporary tables. If you don't have these keys yet, follow the steps outlined in the [AWS documentation](https://docs.aws.amazon.com/IAM/latest/UserGuide/id%5Fcredentials%5Faccess-keys.html). Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse. ``` /* Datafold utilizes a temporary dataset to materialize scratch work and keep data processing witin your data warehouse. */ CREATE SCHEMA IF NOT EXISTS awsdatacatlog.datafold_tmp; ``` ### Configure in Datafold | Field Name | Description | | --------------------------- | ------------------------------------------------------------------------------ | | AWS Access Key ID | Your AWS Access Key, which can be found in your AWS Account. | | AWS Secret Access Key | The AWS Secret Key (generate it in your AWS account if you don't have it yet). | | S3 Staging Directory | The S3 bucket where table data is stored. | | AWS Region | The region of your Athena cluster. | | Catalog | The catalog, which is typically awsdatacatalog by default. | | Database | The database or schema with tables, typically default by default. | | Schema for Temporary Tables | The schema (datafold\_tmp) created in our SQL script. | Click **Create** to complete the setup of your data connection in Datafold. --- # Source: https://docs.datafold.com/datafold-deployment/dedicated-cloud/aws.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Datafold VPC Deployment on AWS > Learn how to deploy Datafold in a Virtual Private Cloud (VPC) on AWS. **INFO** VPC deployments are an Enterprise feature. Please email [sales@datafold.com](mailto:sales@datafold.com) to enable your account. ## Create a Domain Name (optional) You can either choose to use your domain (for example, `datafold.domain.tld`) or to use a Datafold managed domain (for example, `yourcompany.dedicated.datafold.com`). ### Customer Managed Domain Name Create a DNS A-record for the domain where Datafold will be hosted. For the DNS record, there are two options: * **Public-facing:** When the domain is publicly available, we will provide an SSL certificate for the endpoint. * **Internal:** It is also possible to have Datafold disconnected from the internet. This would require an internal DNS (for example, AWS Route 53) record that points to the Datafold instance. It is possible to provide your own certificate for setting up the SSL connection. Once the deployment is complete, you will point that A-record to the IP address of the Datafold service. ## Give Datafold Access to AWS For setting up Datafold, it is required to set up a separate account within your organization where we can deploy Datafold. We're following the [best practices of AWS to allow third-party access](https://docs.aws.amazon.com/IAM/latest/UserGuide/id%5Froles%5Fcommon-scenarios%5Fthird-party.html). ### Create a separate AWS account for Datafold First, create a new account for Datafold. Go to **My Organization** to add an account: Click **Add an AWS Account**: You can name this account anything that helps identify it clearly. In our examples, we name it **Datafold**. Make sure that the email address of the owner isn't used by another account. When you click the **Create AWS Account** button, you'll be returned back the organization screen, and see the notification that the new account is being created. After refresh a few minutes later, the account should appear in the organizations list. ### Grant Third-Party access to Datafold To make sure that deployment runs as expected, your Datafold Support Engineer may need access to the Datafold-specific AWS account that you created. The access can be revoked after the deployment if needed. To grant access, log into the account created in the previous step. You can switch to the newly created account using the [Switch Role page](https://signin.aws.amazon.com/switchrole): By default, the role name is **OrganizationAccountAccessRole**. Click **Switch Role** to log in to the Datafold account. ## Grant Access to Datafold Next, we need to allow Datafold to access the account. We do this by allowing the Datafold AWS account to access your AWS workspace. Go to the [IAM page](https://console.aws.amazon.com/iam/home) or type **IAM** in the search bar: Go to the Roles page, and click the **Create Role** button: Select **Another AWS Account**, and use account ID `710753145501`, which is Datafold's account ID. Select **Require MFA** and click **Next: Permissions**. On the Permissions page, attach the **AdministratorAccess** permissions for Datafold to have control over the resources within the account, or see [Minimal IAM Permissions](#minimal-iam-permissions). Next, you can set **Tags**; however, they are not a requirement. Finally, give the role a name of your choice. Be careful not to duplicate the account name. If you named the account in an earlier step `Datafold`, you may want to name the role `Datafold-role`. Click **Create Role** to complete this step. Now that the role is created, you should be routed back to a list of roles in your organization. Click on your newly created role to get a sharable link for the account and store this in your password manager. When setting up your deployment with a support engineer, Datafold will use this link to gain access to the account. After validating the deployment with your support engineer, and making sure that everything works as it should, we will let you know when it's clear to revoke the credentials. ### Minimal IAM Permissions Because we work in a Account dedicated to Datafold, there is no direct access to your resources unless explicitly configured (e.g., VPC Peering). The following IAM policy are required to update and maintain the infrastructure. ```JSON theme={null} { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "acm:AddTagsToCertificate", "acm:DeleteCertificate", "acm:DescribeCertificate", "acm:GetCertificate", "acm:ListCertificates", "acm:ListTagsForCertificate", "acm:RemoveTagsFromCertificate", "acm:RequestCertificate", "acm:UpdateCertificateOptions", "apigateway:DELETE", "apigateway:GET", "apigateway:PATCH", "apigateway:POST", "apigateway:PUT", "apigateway:UpdateRestApiPolicy", "autoscaling:*", "ec2:*", "eks:*", "elasticloadbalancing:*", "iam:GetPolicy", "iam:GetPolicyVersion", "iam:GetOpenIDConnectProvider", "iam:GetRole", "iam:GetRolePolicy", "iam:GetUserPolicy", "iam:GetUser", "iam:ListAccessKeys", "iam:ListAttachedRolePolicies", "iam:ListGroupsForUser", "iam:ListInstanceProfilesForRole", "iam:ListPolicies", "iam:ListPolicyVersions", "iam:ListRolePolicies", "iam:PassRole", "iam:TagOpenIDConnectProvider", "iam:TagPolicy", "iam:TagRole", "iam:TagUser", "kms:CreateAlias", "kms:CreateGrant", "kms:CreateKey", "kms:Decrypt", "kms:DeleteAlias", "kms:DescribeKey", "kms:DisableKey", "kms:EnableKeyRotation", "kms:GenerateDataKey", "kms:GetKeyPolicy", "kms:GetKeyRotationStatus", "kms:ListAliases", "kms:ListResourceTags", "kms:PutKeyPolicy", "kms:RevokeGrant", "kms:ScheduleKeyDeletion", "kms:TagResource", "logs:CreateLogGroup", "logs:DeleteLogGroup", "logs:DescribeLogGroups", "logs:ListTagsLogGroup", "logs:ListTagsForResource", "logs:PutRetentionPolicy", "logs:TagResource", "rds:*", "ssm:GetParameter", "secretsmanager:CreateSecret", "secretsmanager:DeleteSecret", "secretsmanager:DescribeSecret", "secretsmanager:GetResourcePolicy", "secretsmanager:PutSecretValue", "secretsmanager:TagResource", "s3:*" ], "Resource": "*" } ] } ``` Some policies we need from time to time. For example, when we do the first deployment. Since those are IAM-related, we will ask for temporary permissions when required. ```JSON theme={null} { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "iam:AttachRolePolicy", "iam:CreateAccessKey", "iam:CreateOpenIDConnectProvider", "iam:CreatePolicy", "iam:CreateRole", "iam:CreateUser", "iam:DeleteAccessKey", "iam:DeleteOpenIDConnectProvider", "iam:DeletePolicy", "iam:DeleteRole", "iam:DeleteRolePolicy", "iam:DeleteUser", "iam:DeleteUserPolicy", "iam:DetachRolePolicy", "iam:PutRolePolicy", "iam:PutUserPolicy" ], "Resource": "*" } ] } ``` It is easier to allow `PowerUserAccess` and then selectively add iam permissions given above. PowerUserAccess has explicit denies for `account:*`, `organization:*` and `iam:*.` # Datafold AWS infrastructure details This document provides detailed information about the AWS infrastructure components deployed by the Datafold Terraform module, explaining the architectural decisions and operational considerations for each component. ## EBS volumes The Datafold application requires 3 volumes for persistent storage, each deployed as encrypted Elastic Block Store (EBS) volumes in the primary availability zone. This also means that pods cannot be deployed outside the availability zone of these volumes, because the nodes wouldn't be able to attach them. **ClickHouse data volume** serves as the analytical database storage for Datafold. ClickHouse is a columnar database that excels at analytical queries. The default 40GB allocation usually provides sufficient space for typical deployments, but it can be scaled up based on data volume requirements. The GP3 volume type with 3000 IOPS ensures consistent performance for analytical workloads. **ClickHouse Logs Volume** stores ClickHouse's internal logs and temporary data. The separate logs volume prevents log data from consuming IOPS and I/O performance from actual data storage. **Redis Data Volume** provides persistent storage for Redis, which handles task distribution and distributed locks in the Datafold application. Redis is memory-first but benefits from persistence for data durability across restarts. The 50GB default size accommodates typical caching needs while remaining cost-effective. All EBS volumes are encrypted using AWS KMS, managed by AWS, ensuring data security at rest. The volumes are deployed in the first availability zone to minimize latency and simplify backup strategies. ## Load balancer The load balancer serves as the primary entry point for all external traffic to the Datafold application. The module offers 2 deployment strategies, each with different operational characteristics and trade-offs. **External Load Balancer Deployment** (the default approach) creates an AWS Application Load Balancer through Terraform. This approach provides centralized control over load balancer configuration and integrates well with existing AWS infrastructure. The load balancer automatically handles SSL termination, health checks, and traffic distribution across Kubernetes pods. This method is ideal for organizations that prefer infrastructure-as-code management and want consistent load balancer configurations across environments. **Kubernetes-Managed Load Balancer** deployment sets `deploy_lb = false` and relies on the AWS Load Balancer Controller running within the EKS cluster. This approach leverages Kubernetes-native load balancer management, allowing for dynamic scaling and easier integration with Kubernetes ingress resources. The controller automatically provisions and manages load balancers based on Kubernetes service definitions, which can be more flexible for applications that need to scale load balancer resources dynamically. Both load balancers apply the currently recommended and strictest ELB security policies: `ELBSecurityPolicy-TLS13-1-2-Res-2021-06` and security settings. The choice between these approaches often depends on operational preferences and existing infrastructure patterns. External deployment provides more predictable resource management, while Kubernetes-managed deployment offers greater flexibility for dynamic workloads. **Security** A security group shared between the load balancer and the EKS nodes allows traffic to reach only the EKS nodes and nothing else. The load balancer allows traffic to land directly into the EKS private subnet. **Certificate** The certificate can be pre-created by the customer and then attached, or a cloud-managed certificate can be created on the fly. The application will not function without HTTPS, so a certificate is mandatory. After the certificate is created either manually or through this repository, it must be validated by the DNS administrator by adding a CNAME record. This puts the certificate in "Issued" state. The certificate cannot be found when it's still provisioning. ## EKS cluster The Elastic Kubernetes Service (EKS) cluster forms the compute foundation for the Datafold application, providing a managed Kubernetes environment optimized for AWS infrastructure. **Network Architecture** The entire cluster is deployed into private subnets. This means the data plane is not reachable from the Internet except through the load balancer. A NAT gateway allows the cluster to reach the internet (egress traffic) for downloading pod images, optionally sending Datadog logs and metrics, and retrieving the version to apply to the cluster from our portal. The control plane is accessible via a private endpoint using a PrivateLink setup from, for example, a VPN VPC elsewhere. This is a private+public endpoint, so the control plane can also be made accessible through the Internet, but then the appropriate CIDR restrictions should be put in place. For a typical dedicated cloud deployment of Datafold, only around 100 IPs are needed. This assumes 3 r7a.2xlarge instances where one node runs ClickHouse+Redis, another node runs the application, and a third node may be put in place when version rollovers occur. This means a subnet of size /24 (253 IPs) should be sufficient to run this application. By default, the repository creates a VPC and subnets, but by specifying the VPC ID of an already existing VPC, the cluster and load balancer get deployed into existing network infrastructure. This is important for some customers where they deploy a different architecture without NAT gateways, firewall options that check egress, and other DLP controls. **Add-ons** The cluster includes essential add-ons like CoreDNS for service discovery, the VPC CNI for networking, and the EBS CSI driver for persistent volume management. These components are automatically updated and maintained by AWS, reducing operational overhead. The AWS load balancer controller and metrics-server are deployed separately via Helm charts in the application deployment, not through this Terraform infrastructure. The Load Balancer Controller manages at least the AWS target group that enables ingress for the Datafold application. Optionally, it may also manage the entire external load balancer. **Node Management** supports up to three managed node groups, allowing for workload-specific resource allocation. Each node group can be configured with different instance types, enabling cost optimization and performance tuning for different application components. The cluster autoscaler automatically adjusts node count based on resource demands, ensuring efficient resource utilization while maintaining application availability. One typical way to deploy is to let the application pods go on a wider range of nodes, and set up tolerations and labels on the second node group, which are then selected by both Redis and ClickHouse. This is because Redis and ClickHouse have restrictions on the zone they must be present in because of their volumes, and ClickHouse is a bit more CPU intensive. This method optimizes CPU performance for the Datafold application. **Security Features** include IAM Roles for Service Accounts (IRSA), which provide fine-grained IAM permissions to Kubernetes pods without requiring AWS credentials in container images. This approach enhances security by following the principle of least privilege and integrates seamlessly with AWS security services. ## IAM Roles and Permissions The IAM architecture follows the principle of least privilege, providing specific permissions only where needed. Service accounts in Kubernetes are mapped to IAM roles using IRSA, enabling secure access to AWS services without embedding credentials in application code. **EBS CSI Controller Role** enables the Kubernetes cluster to manage EBS volumes dynamically. This role allows pods to request persistent storage that's automatically provisioned and attached to the appropriate nodes or attach static volumes. The permissions are scoped to only the EBS operations needed for volume lifecycle management. **Load Balancer Controller Role** provides the permissions necessary for Kubernetes to manage AWS load balancers. This includes creating target groups, registering and deregistering targets, and managing load balancer listeners. The controller can automatically provision load balancers based on Kubernetes service definitions, enabling seamless integration between Kubernetes and AWS networking. **Cluster Autoscaler Role** allows the cluster to automatically scale node groups based on resource demands. This role can describe and modify Auto Scaling groups, enabling the cluster to add or remove nodes as needed. The autoscaler considers pod resource requests and node capacity when making scaling decisions. **Datafold Roles** Datafold has roles per pod pre-defined which can have their permissions assigned when they need them. At the moment, we have two specific roles in use. One is for the ClickHouse pod to be able to make backups and store them on S3. The other is for the use of the Bedrock service for our AI offering. These roles are automatically created and configured when the cluster is deployed, ensuring that the necessary permissions are in place for the cluster to function properly. The use of IRSA means that these permissions are automatically rotated and managed by AWS, reducing security risks associated with long-lived credentials. ## RDS database The PostgreSQL Relational Database Service (RDS) instance serves as the primary relational database for the Datafold application, storing user data, configuration, and application state. **Storage Configuration** starts with a 20GB initial allocation that can automatically scale up to 100GB based on usage patterns. This auto-scaling feature prevents storage-related outages while avoiding over-provisioning. For typical deployments, storage usage remains under 200GB, though some high-volume deployments may approach 400GB. The GP3 storage type provides consistent performance with configurable IOPS and throughput. **High Availability** is intentionally disabled by default, meaning the database runs in a single availability zone. This configuration reduces costs and complexity while still providing excellent reliability. The database includes automated backups with 14-day retention, ensuring data can be recovered in case of failures. For organizations requiring higher availability, multi-AZ deployment can be enabled, though this significantly increases costs. **Security and Encryption** always encrypts data at rest using AWS KMS. A dedicated KMS key is created for the database, providing better security isolation and audit capabilities compared to using the default AWS RDS key. The database is deployed in private subnets with security groups that restrict access to only the EKS cluster, ensuring network-level security. The database configuration prioritizes operational simplicity and cost-effectiveness while maintaining the security and reliability required for production workloads. The combination of automated backups, encryption, and network isolation provides a robust foundation for the application's data storage needs. --- # Source: https://docs.datafold.com/integrations/code-repositories/azure-devops.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Azure DevOps ## 1. Issue an Access Token To get your [repository access token](https://learn.microsoft.com/en-us/azure/devops/organizations/accounts/use-personal-access-tokens-to-authenticate?view=azure-devops\&tabs=Windows#create-a-pat), navigate to your Azure DevOps settings and create a new token. When configuring your token, enable following permissions: * **Code** -> **Read & write** * **Identity** -> **Read** We need write access to the repository to post reports with Data Diff results to pull requests, and read access to identities to be able to properly display Azure DevOps users in the Datafold UI. ## 2. Configure integration in Datafold Navigate back to Datafold and fill in the configuration form. * **Personal/project Access Token**: the token you created in step 1. * **Organization**: your Azure DevOps organization name. * **Project**: your Azure DevOps project name. * **Repository**: your Azure DevOps repository name. For example, if your Azure DevOps repository URL is `https://dev.azure.com/datafold/analytics/_git/dbt`: * Your **Organization** is `datafold` * your **Project** is `analytics` * your **Repository** is `dbt` --- # Source: https://docs.datafold.com/datafold-deployment/dedicated-cloud/azure.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Datafold VPC Deployment on Azure > Learn how to deploy Datafold in a Virtual Private Cloud (VPC) on Azure. **INFO** VPC deployments are an Enterprise feature. Please email [sales@datafold.com](mailto:sales@datafold.com) to enable your account. ## Create a Domain Name (optional) You can either choose to use your domain (for example, `datafold.domain.tld`) or to use a Datafold managed domain (for example, `yourcompany.dedicated.datafold.com`). ### Customer Managed Domain Name Create a DNS A-record for the domain where Datafold will be hosted. For the DNS record, there are two options: * **Public-facing:** When the domain is publicly available, we will provide an SSL certificate for the endpoint. * **Internal:** It is also possible to have Datafold disconnected from the internet. This would require an internal DNS (for example, Azure DNS) record that points to the Datafold instance. It is possible to provide your own certificate for setting up the SSL connection. Once the deployment is complete, you will point that A-record to the IP address of the Datafold service. ## Create a New Subscription For isolation reasons, it is best practice to [create a new subscription](https://learn.microsoft.com/en-us/azure/cost-management-billing/manage/create-subscription) within your Microsoft Entra directory/tenant. Please call it something like `yourcompany-datafold` to make it easy to identify. ## Set IAM Permissions Go to **Microsoft Entra ID** and navigate to **Users**. Click **Add**, **User**, **Invite external user** and add the Datafold engineers. Navigate to the subscription you just created and go to **Access control (IAM)** tab in the side bar. * Navigate to the subscription you just created. Go to **Access control (IAM)**. Under **Add** select **Add role assignment**. * Under **Role**, navigate to **Priviledged administrator roles** and select **Owner**. * Under **Members**, click **Select members** and add the Datafold engineers. * When you are done, select **Review + assign**. The owner role is only required temporarily while we configure and test the initial Datafold deployment. We'll inform you when it is ok to revoke this permission. ### Required APIs The following Azure APIs need to be enabled to run Datafold: 1. [Microsoft.ContainerService](https://portal.azure.com/#view/Microsoft_Azure_Marketplace/GalleryFeaturedMenuItemBlade/selectedMenuItemId/home/searchQuery/Container%20Service) 2. [Microsoft.Network](https://portal.azure.com/#view/Microsoft_Azure_Marketplace/GalleryFeaturedMenuItemBlade/selectedMenuItemId/home/searchQuery/Network) 3. [Microsoft.Compute](https://portal.azure.com/#view/Microsoft_Azure_Marketplace/GalleryFeaturedMenuItemBlade/selectedMenuItemId/home/searchQuery/Compute) 4. [Microsoft.KeyVault](https://portal.azure.com/#view/Microsoft_Azure_Marketplace/GalleryFeaturedMenuItemBlade/selectedMenuItemId/home/searchQuery/Key%20Vault) 5. [Microsoft.Storage](https://portal.azure.com/#view/Microsoft_Azure_Marketplace/GalleryFeaturedMenuItemBlade/selectedMenuItemId/home/searchQuery/Storage) 6. [Microsoft.DBforPostgreSQL](https://portal.azure.com/#view/Microsoft_Azure_Marketplace/GalleryFeaturedMenuItemBlade/selectedMenuItemId/home/searchQuery/PostgreSQL) Once the access has been granted, make sure to notify Datafold so we can initiate the deployment. # Datafold Azure infrastructure details This document provides detailed information about the Azure infrastructure components deployed by the Datafold Terraform module, explaining the architectural decisions and operational considerations for each component. ## Managed disks The Datafold application requires 3 managed disks for persistent storage, each deployed as encrypted Azure managed disks in the primary availability zone. This also means that pods cannot be deployed outside the availability zone of these disks, because the nodes wouldn't be able to attach them. **ClickHouse data disk** serves as the analytical database storage for Datafold. ClickHouse is a columnar database that excels at analytical queries. The default 40GB allocation usually provides sufficient space for typical deployments, but it can be scaled up based on data volume requirements. The StandardSSD\_LRS disk type with configurable IOPS and throughput ensures consistent performance for analytical workloads. **ClickHouse logs disk** stores ClickHouse's internal logs and temporary data. The separate logs disk prevents log data from consuming IOPS and I/O performance from actual data storage. **Redis data disk** provides persistent storage for Redis, which handles task distribution and distributed locks in the Datafold application. Redis is memory-first but benefits from persistence for data durability across restarts. The 50GB default size accommodates typical caching needs while remaining cost-effective. All managed disks are encrypted by default using Azure-managed encryption keys, ensuring data security at rest. The disks are deployed in the first availability zone to minimize latency and simplify backup strategies. For Premium and Ultra SSD disk types, IOPS and throughput can be configured to optimize performance for specific workloads. ## Application Gateway The Application Gateway serves as the primary entry point for all external traffic to the Datafold application. The module offers 2 deployment strategies, each with different operational characteristics and trade-offs. **External Application Gateway Deployment** (the default approach) creates an Azure Application Gateway through Terraform. This approach provides centralized control over load balancer configuration and integrates well with existing Azure infrastructure. The Application Gateway automatically handles SSL termination, health checks, and traffic distribution across Kubernetes pods. This method is ideal for organizations that prefer infrastructure-as-code management and want consistent load balancer configurations across environments. **Kubernetes-Managed Application Gateway** deployment sets `deploy_lb = false` and relies on the Azure Application Gateway Ingress Controller (AGIC) running within the AKS cluster. This approach leverages Kubernetes-native load balancer management, allowing for dynamic scaling and easier integration with Kubernetes ingress resources. The controller automatically provisions and manages Application Gateways based on Kubernetes service definitions, which can be more flexible for applications that need to scale load balancer resources dynamically. Both Application Gateways apply the currently recommended and strictest SSL policies: `AppGwSslPolicy20220101S` and security settings. The choice between these approaches often depends on operational preferences and existing infrastructure patterns. External deployment provides more predictable resource management, while Kubernetes-managed deployment offers greater flexibility for dynamic workloads. **Security** A network security group shared between the Application Gateway and the AKS nodes allows traffic to reach only the AKS nodes and nothing else. The Application Gateway allows traffic to land directly into the AKS private subnet. **Certificate** The certificate can be pre-created by the customer and then attached, or a cloud-managed certificate can be created on the fly. The application will not function without HTTPS, so a certificate is mandatory. After the certificate is created either manually or through this repository, it must be validated by the DNS administrator by adding a CNAME record. This puts the certificate in "Issued" state. The certificate cannot be found when it's still provisioning. ## AKS cluster The Azure Kubernetes Service (AKS) cluster forms the compute foundation for the Datafold application, providing a managed Kubernetes environment optimized for Azure infrastructure. **Network Architecture** The entire cluster is deployed into private subnets. This means the data plane is not reachable from the Internet except through the Application Gateway. A NAT gateway allows the cluster to reach the internet (egress traffic) for downloading pod images, optionally sending Datadog logs and metrics, and retrieving the version to apply to the cluster from our portal. The control plane is accessible via a private endpoint using a Private Link setup from, for example, a VPN VNet elsewhere. This is a private+public endpoint, so the control plane can also be made accessible through the Internet, but then the appropriate CIDR restrictions should be put in place. For a typical dedicated cloud deployment of Datafold, only around 100 IPs are needed. This assumes 3 Standard\_DS2\_v2 instances where one node runs ClickHouse+Redis, another node runs the application, and a third node may be put in place when version rollovers occur. This means a subnet of size /24 (253 IPs) should be sufficient to run this application. By default, the repository creates a VNet and subnets, but by specifying the VNet ID of an already existing VNet, the cluster and Application Gateway get deployed into existing network infrastructure. This is important for some customers where they deploy a different architecture without NAT gateways, firewall options that check egress, and other DLP controls. **Add-ons** The cluster includes several essential add-ons configured through Terraform: **Workload Identity** is enabled to provide fine-grained IAM permissions to Kubernetes pods without requiring Azure credentials in container images. This is essential for ClickHouse to access Azure Storage for backups and other services. **Ingress Application Gateway** is integrated with the cluster to handle external traffic routing and SSL termination. The Application Gateway Ingress Controller (AGIC) manages the Application Gateway configuration based on Kubernetes ingress resources. **Storage Profile** includes the Azure Disk CSI driver for persistent volume management, file driver for Azure Files, and snapshot controller for volume snapshots. These components enable dynamic provisioning and management of Azure storage resources. **Node Management** supports up to three managed node pools, allowing for workload-specific resource allocation. Each node pool can be configured with different VM sizes, enabling cost optimization and performance tuning for different application components. The cluster autoscaler automatically adjusts node count based on resource demands, ensuring efficient resource utilization while maintaining application availability. One typical way to deploy is to let the application pods go on a wider range of nodes, and set up tolerations and labels on the second node pool, which are then selected by both Redis and ClickHouse. This is because Redis and ClickHouse have restrictions on the zone they must be present in because of their disks, and ClickHouse is a bit more CPU intensive. This method optimizes CPU performance for the Datafold application. **Security Features** include Azure Workload Identity, which provides fine-grained IAM permissions to Kubernetes pods without requiring Azure credentials in container images. This approach enhances security by following the principle of least privilege and integrates seamlessly with Azure security services. The cluster also supports private clusters with restricted control plane access and network policies for pod-to-pod communication control. ## IAM Roles and Permissions The IAM architecture follows the principle of least privilege, providing specific permissions only where needed. Service accounts in Kubernetes are mapped to IAM roles using Azure Workload Identity, enabling secure access to Azure services without embedding credentials in application code. **Azure Disk CSI Controller Role** enables the Kubernetes cluster to manage Azure managed disks dynamically. This role allows pods to request persistent storage that's automatically provisioned and attached to the appropriate nodes or attach static disks. The permissions are scoped to only the Azure Disk operations needed for disk lifecycle management. **Application Gateway Ingress Controller Role** provides the permissions necessary for Kubernetes to manage Azure Application Gateways. This includes creating backend address pools, registering and deregistering targets, and managing Application Gateway listeners. The controller can automatically provision Application Gateways based on Kubernetes service definitions, enabling seamless integration between Kubernetes and Azure networking. **Cluster Autoscaler Role** allows the cluster to automatically scale node pools based on resource demands. This role can describe and modify Virtual Machine Scale Sets, enabling the cluster to add or remove nodes as needed. The autoscaler considers pod resource requests and node capacity when making scaling decisions. **Datafold Roles** Datafold has roles per pod pre-defined which can have their permissions assigned when they need them. At the moment, we have two specific roles in use. One is for the ClickHouse pod to be able to make backups and store them on Azure Storage. The other is for the use of the Azure OpenAI service for our AI offering. These roles are automatically created and configured when the cluster is deployed, ensuring that the necessary permissions are in place for the cluster to function properly. The use of Azure Workload Identity means that these permissions are automatically rotated and managed by Azure, reducing security risks associated with long-lived credentials. ## Azure Database for PostgreSQL The Azure Database for PostgreSQL Flexible Server instance serves as the primary relational database for the Datafold application, storing user data, configuration, and application state. **Storage Configuration** starts with a 32GB initial allocation that can automatically scale up to 100GB based on usage patterns. This auto-scaling feature prevents storage-related outages while avoiding over-provisioning. For typical deployments, storage usage remains under 200GB, though some high-volume deployments may approach 400GB. The GP\_Standard storage type provides consistent performance with configurable IOPS and throughput. **High Availability** is intentionally disabled by default, meaning the database runs in a single availability zone. This configuration reduces costs and complexity while still providing excellent reliability. The database includes automated backups with 7-day retention, ensuring data can be recovered in case of failures. For organizations requiring higher availability, multi-zone deployment can be enabled, though this significantly increases costs. **Security and Encryption** always encrypts data at rest using Azure-managed encryption keys. The database is deployed in private subnets with network security groups that restrict access to only the AKS cluster, ensuring network-level security. The database supports Azure Private Link for secure, private connectivity from the VNet. The database configuration prioritizes operational simplicity and cost-effectiveness while maintaining the security and reliability required for production workloads. The combination of automated backups, encryption, and network isolation provides a robust foundation for the application's data storage needs. --- # Source: https://docs.datafold.com/data-diff/in-database-diffing/best-practices.md # Source: https://docs.datafold.com/data-diff/cross-database-diffing/best-practices.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Best Practices > When dealing with large datasets, it's crucial to approach diffing with specific optimization strategies in mind. We share best practices that will help you get the most accurate and efficient results from your data diffs. ## Enable sampling [Sampling](/data-diff/cross-database-diffing/creating-a-new-data-diff#row-sampling) can be helpful when diffing between extremely large datasets as it can result in a speedup of 2x to 20x or more. The extent of the speedup depends on various factors, including the scale of the data, instance sizes, and the number of data columns. The following table illustrates the speedup achieved with sampling in different databases, varying instance sizes, and different numbers of data columns: | Databases | vCPU | RAM, GB | Rows | Columns | Time full | Time sampled | Speedup | RDS type | Diff full | Diff sampled | Per-col noise | | :-----------------: | :--: | :-----: | :-------: | :-----: | :-------: | :----------: | :-----: | :-----------: | :-------: | :----------: | :-----------: | | Oracle vs Snowflake | 2 | 2 | 1,000,000 | 1 | 0:00:33 | 0:00:27 | 1.22 | db.t3.small | 5399 | 5400 | 0 | | Oracle vs Snowflake | 8 | 32 | 1,000,000 | 1 | 0:07:23 | 0:00:18 | 24.61 | db.m5.2xlarge | 5422 | 5423 | 0.005 | | MySQL vs Snowflake | 2 | 8 | 1,000,000 | 1 | 0:00:57 | 0:00:24 | 2.38 | db.m5.large | 5409 | 5413 | 0 | | MySQL vs Snowflake | 2 | 8 | 1,000,000 | 29 | 0:40:00 | 0:02:14 | 17.91 | db.m5.large | 5412 | 5411 | 0 | When sampling is enabled, Datafold compares a randomly chosen subset of the data. Sampling is the tradeoff between the diff detail and time/cost of the diffing process. For most use cases, sampling does not reduce the informational value of data diffs as it still provides the magnitude and specific examples of differences (e.g., if 10% of sampled data show discrepancies, it suggests a similar proportion of differences across the entire dataset). Although configuring sampling can seem overwhelming at first, a good rule of thumb is to select an initial value of 95% for the sampling confidence and adjust it as needed. Tweaking the parameters can be helpful to see how they impact the sample size and the tradeoff between performance and accuracy. ## Handling data type differences Datafold automatically manages data type differences during cross-database diffing. For example, when comparing decimals with different precisions (e.g., `DECIMAL(38,15)` in SQL Server and `DECIMAL(38,19)` in Snowflake), Datafold automatically casts values to a common precision before comparison, flagging any differences appropriately. Similarly, for timestamps with different precisions (e.g., milliseconds in SQL Server and nanoseconds in Snowflake), Datafold adjusts the precision as needed for accurate comparisons, simplifying the diffing process. ## Optimizing OLTP databases: indexing best practices When working with row-oriented transactional databases like PostgreSQL, optimizing the database structure is crucial for efficient data diffing, especially for large tables. Here are some best practices to consider: * **Create indexes on key columns**: * It's essential to create indexes on the columns that will be compared, particularly the primary key columns defined in the data diffs. * **Example**: If your data diff involves primary key columns `colA` and `colB`, ensure that indexes are created for these specific columns. * **Use separate indexes for primary key columns:** * Indexes for primary key columns should be distinct and start with these columns, not as subsets of other indexes. Having a dedicated primary key index is critical for efficient diffing. * **Example**: Consider a primary key consisting of `colA` and `colB`. Ensure that the index is structured in the same order, like (`colA`, `colB`), to align with the primary key. An index with an order of (`colB`, `colA`) is strongly discouraged due to the impact on performance. * **Example**: If the index is defined as (`colA`, `colB`, `colC`) and the primary key is a combination of `colA` and `colB`, then when setting up the diff operation, ensure that the primary key is specified as `colA`, `colB.` If the order is reversed as `colB`, `colA`, the diffing process won’t be able to fully utilize indexing, potentially leading to slower performance. * **Leverage compound indexes**: * Compound indexes, which involve multiple columns, can significantly improve query performance during data diffs as they efficiently handle complex queries and filtering. * **Example**: An index defined as (`colA`, `colB`, `colC`) can be beneficial for diffing operations involving these columns, as it aligns with the order of columns in the primary key. ## Handling high percentage of differences Data diff is optimized to perform best when the percent of different rows/values is relatively low, to support common data validation scenarios like data replication and migration. While the tool strives to maximize the database's computational power and minimize data transfer, in extreme cases with very high difference percentages (up to 100%), it may result in transferring every row over the network, which is considerably slower. In order to avoid long-running diffs, we recommend the following: * **Start with diffing [primary keys](/data-diff/cross-database-diffing/creating-a-new-data-diff#primary-key)** only to identify row-level completeness first, before diffing all or more columns. * **Set an [egress](/data-diff/cross-database-diffing/creating-a-new-data-diff#primary-key) limit** to automatically stop the diffing process after set number of rows are downloaded over the network. * **Set a [per-column diff](/data-diff/cross-database-diffing/creating-a-new-data-diff#primary-key) limit** to stop finding differences for each column after a set number are found. This is especially useful in data reconciliation where identifying a large number of discrepancies (e.g., large percentage of missing/different rows) early on indicates that a detailed row-by-row diff may not be required, thereby saving time and computational resources. In the screenshot below, we see that exactly 4 differences were found in `user_id`, but “at least 4,704 differences” were found in `total_runtime_seconds`. `user_id` has a number of differences below the per-column diff limit, and so we state the exact number. On the other hand, `total_runtime_seconds` has a number of differences greater than the per-column diff limit, so we state “at least.” Note that due to our algorithm’s approach, we often find significantly more differences than the limit before diffing is halted, and in that scenario, we report the value that was found, while stating that more differences may exist. ## Executing queries in parallel Increase the number of concurrent connections to the database in Datafold. This enables queries to be executed in parallel, significantly accelerating the diff process. Navigate to the **Settings** option in the left sidebar menu of Datafold. Adjust the **max connections** setting to increase the number of concurrent connections Datafold can establish with your data. Note that the maximum allowable value for concurrent connections is 64. ## Optimize column selection The number of columns included in the diff directly impacts its speed: selecting fewer columns typically results in faster execution. To optimize performance, refine your column selection based on your specific use case: * **Comprehensive verification**: For in-depth analysis, include all columns in the diff. This method is the most thorough, suitable for exhaustive data reviews, albeit time-intensive for wide tables. * **Minimal verification**: Consider verifying only the primary key and `updated_at` columns. This is efficient and sufficient if you need to validate rows have not been added or removed, and that updates are current between databases, but do not need to check for value-level differences between rows with common primary keys. * **Presence verification**: If your main concern is just the presence of data (whether data exists or has been removed), such as identifying missing hard deletes, verifying only the primary key column can be sufficient. * **Hybrid verification**: Focus on key columns that are most critical to your operations or data integrity, such as monetary values in an `amount` column, while omitting large serialized or less critical columns like `json_settings`. ## Managing primary key distribution Significant gaps in the primary key column can decrease diff efficiency (e.g., 10s of millions of continuous rows missing). Datafold will execute queries for non-existent row ranges, which can slow down the data diff. ## Handling different primary key types As a general rule, primary keys should be of the same (or similar) type in both datasets for diffing to work properly. Comparing primary keys of different types (e.g., `INT` vs `VARCHAR`) will result in a type mismatch error. You can still diff such datasets by casting the primary key column to the same type in both datasets explicitly. Indexes on the primary key typically cannot be utilized when the primary key is cast to a different type. This may result in slower diffing performance. Consider creating a separate index, such as [expression index in PostgreSQL](https://www.postgresql.org/docs/current/indexes-expressional.html), to improve performance. --- # Source: https://docs.datafold.com/integrations/databases/bigquery.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # BigQuery **Steps to complete:** 1. [Create a Service Account](/integrations/databases/bigquery#create-a-service-account) 2. [Give the Service Account BigQuery Data Viewer, BigQuery Job User, BigQuery Resource Viewer access](/integrations/databases/bigquery#service-account-access-and-permissions) 3. [Create a temporary dataset and give BiqQuery Data Editor access to the service account](/integrations/databases/bigquery#create-a-temporary-dataset) 4. [Generate a Service Account JSON key](/integrations/databases/bigquery#generate-a-service-account-key) 5. [Configure your data connection in Datafold](/integrations/databases/bigquery#configure-in-datafold) ## Create a Service Account To connect Datafold to your BigQuery project, you will need to create a *service account* for Datafold to use. * Navigate to the [Google Developers Console](https://console.developers.google.com/), click on the drop-down to the left of the search bar, and select the project you want to connect to. * *Note: If you do not see your project, you may need to switch accounts.* * Click on the hamburger menu in the upper left, then select **IAM & Admin** followed by **Service Accounts**. * Create a service account named `Datafold`. ## Service Account Access and Permissions The Datafold service account requires the following roles and permissions: * **BigQuery Data Viewer** for read access on all the datasets in the project. * **BigQuery Job User** to run queries. * **BigQuery Resource Viewer** to fetch the query logs for parsing lineage. ## Create a Temporary Dataset Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in your warehouse. **Caution** - Make sure that the dataset lives in the same region as the rest of the data, otherwise, the dataset will not be found. Let's navigate to BigQuery in the console and create a new dataset. * Give the dataset a name like `datafold_tmp` and grant the Datafold service account the **BigQuery Data Editor** role. ## Generate a Service Account Key Next, go back to the **IAM & Admin** page to generate a key for Datafold. We recommend using the json formatted key. After creating the key, it will be saved on your local machine. ## Configure in Datafold | Field Name | Description | | --------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | Name | A name given to the data connection within Datafold | | Project ID | Your BigQuery project ID. It can be found in the URL of your Google Developers Console: [https://console.developers.google.com/apis/library?project=MY\\\_PROJECT\\\_ID](https://console.developers.google.com/apis/library?project=MY\\_PROJECT\\_ID) | | JSON Key File | The key file generated in the [Generate a Service Account JSON key](/integrations/databases/bigquery#generate-a-service-account-key) step | | Schema for temporary tables | The schema name that was created in [Create a temporary dataset](/integrations/databases/bigquery#create-a-temporary-dataset). It should be formatted as \.datafold\_tmp | | Processing Location | Which processing zone your project uses | Click **Create**. Your data connection is ready! --- # Source: https://docs.datafold.com/integrations/code-repositories/bitbucket.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Bitbucket ## 1. Issue an Access Token ### Bitbucket Cloud To get the [repository access token](https://support.atlassian.com/bitbucket-cloud/docs/create-a-repository-access-token/), navigate to your Bitbucket repository settings and create a new token. When configuring your token, enable following permissions: * **Pull requests** -> **Write**, so that Datafold can post reports with Data Diff results to pull requests. * **Webhooks** -> **Read and write**, so that Datafold can configure all webhooks that we need automatically. ### Bitbucket Data Center / Server To get a [repository access token](https://confluence.atlassian.com/bitbucketserver/http-access-tokens-939515499.html), navigate to your Bitbucket repository settings and create a new token. When configuring your token, enable **Repository admin** permissions. We need admin access to the repository to be able to post reports with Data Diff results to pull requests, and also configure all necessary webhooks automatically. ## 2. Configure integration in Datafold Navigate back to Datafold and fill in the configuration form. ### Bitbucket Cloud * **Personal/project Access Token**: the token you created in step 1. * **Repository**: your Bitbucket repository name. For example, if your Bitbucket project URL is `https://bitbucket.org/datafold/dbt/`, your Project Name is `datafold/dbt`. ### Bitbucket Data Center / Server * **Personal/project Access Token**: the token you created in step 1. * **Repository**: the full URL of your Bitbucket repository. For example, `https://bitbucket.myorg.com/projects/datafold/repos/dbt`. --- # Source: https://docs.datafold.com/api-reference/dma_v2/check-status-of-a-dma-translation-job.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Check status of a DMA translation job > Get the current status and results of a DMA translation job. Poll this endpoint to monitor translation progress and retrieve results when complete. Translation jobs can run for several minutes to hours depending on project size. ## OpenAPI ````yaml openapi-public.json get /api/v1/dma/v2/projects/{project_id}/translate/jobs/{job_id} openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/dma/v2/projects/{project_id}/translate/jobs/{job_id}: get: tags: - DMA_V2 summary: Check status of a DMA translation job description: >- Get the current status and results of a DMA translation job. Poll this endpoint to monitor translation progress and retrieve results when complete. Translation jobs can run for several minutes to hours depending on project size. operationId: get_translation_status parameters: - in: path name: project_id required: true schema: title: Project Id type: integer - in: path name: job_id required: true schema: title: Job Id type: string responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiTranslateTask' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ApiTranslateTask: description: Response for translation task. properties: status: $ref: '#/components/schemas/JobStatus' task_id: title: Task Id type: string translated_models: anyOf: - items: $ref: '#/components/schemas/ApiTranslatedModel' type: array - type: 'null' title: Translated Models required: - task_id - status title: ApiTranslateTask type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object JobStatus: enum: - needs_confirmation - needs_authentication - waiting - processing - done - failed - cancelled title: JobStatus type: string ApiTranslatedModel: description: Information about a translated model. properties: asset_id: title: Asset Id type: string asset_name: title: Asset Name type: string datadiff_id: anyOf: - type: integer - type: 'null' title: Datadiff Id failure_summary: anyOf: - $ref: '#/components/schemas/ApiFailureSummary' - type: 'null' source_filename: anyOf: - type: string - type: 'null' title: Source Filename source_sql: anyOf: - type: string - type: 'null' title: Source Sql target_sql: anyOf: - type: string - type: 'null' title: Target Sql translation_status: $ref: '#/components/schemas/ApiTranslationStatus' required: - asset_name - asset_id - translation_status title: ApiTranslatedModel type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object ApiFailureSummary: description: Structured failure summary with problem, error, and solution sections. properties: error_message: title: Error Message type: string location: anyOf: - type: string - type: 'null' title: Location problem: title: Problem type: string reason: $ref: '#/components/schemas/ApiFailureReason' solution: title: Solution type: string required: - problem - error_message - solution - reason title: ApiFailureSummary type: object ApiTranslationStatus: enum: - no_translation_attempts - validation_pending - invalid_translation - valid_translation title: ApiTranslationStatus type: string ApiFailureReason: description: Reasons why a translation agent failed to complete its task. enum: - max_iterations - tool_error - resignation title: ApiFailureReason type: string securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/faq/ci-cd-testing.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # CI/CD Testing You can use [SQL filters](/deployment-testing/configuration/model-specific-ci/sql-filters) to ensure that Datafold compares equivalent subsets of data between your staging/dev and production environments, allowing for accurate data quality checks despite the difference in data volume. Yes, you can use Datafold in development. It helps catch data quality issues early by comparing data changes in your development environment before they reach production. This proactive approach ensures that errors and inconsistencies are identified and resolved during the development process, enhancing overall data reliability and preventing potential issues in production. Data teams can leverage the Datafold SDK to run data diffs from the command line while developing and testing data models. Data drift in CI occurs when the two data transformation builds that are compared by Datafold in CI have differing data outputs due to the upstream data changing over time. We have a few recommended strategies for dealing with data drift [in our docs here](/deployment-testing/best-practices/handling-data-drift). Some teams want to show Data Diff results in their tickets *before* creating a pull request. This speeds up code reviews as developers can QA code changes before requesting a PR review. If you use dbt, we explain [how you can automate this workflow here](/faq/datafold-with-dbt#can-i-run-data-diffs-before-opening-a-pr). --- # Source: https://docs.datafold.com/integrations/code-repositories.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Integrate with Code Repositories > Connect your code repositories with Datafold. **NOTE** To integrate with code repositories, first connect a [Data Connection](/integrations/databases). Next, go to **Settings** → **Repositories** and click **Add New Integration**. Then, choose your code repository provider. --- # Source: https://docs.datafold.com/deployment-testing/configuration/column-remapping.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Column Remapping > Specify column renaming in your git commit message so Datafold can map renamed columns to their original counterparts in production for accurate comparison. When your PR includes updates to column names, it's important to specify these updates in your git commit message using the following syntax. This allows Datafold to understand how renamed columns should be compared to the column in the production data with the original name. ## Example By specifying column remapping in the commit message, instead of interpreting the change as a removing one column and adding another: Datafold will recognize that the column has been renamed: ## Syntax for column remapping You can use any of the following syntax styles as a single line to a commit message to instruct Datafold in CI to remap a column from `oldcol` to `newcol`. ```Bash theme={null} # All models/tables in the PR: datafold remap oldcol newcol X-Datafold: rename oldcol newcol /datafold renamed oldcol newcol datafold: remapped oldcol newcol # Filtered models/tables by shell-like glob: datafold remap oldcol newcol model_NAME X-Datafold: rename oldcol newcol TABLE /datafold renamed oldcol newcol VIEW_* ``` ## Chaining together column name updates Commit messages can be chained together to reflect sequential changes. This means that a commit message does not lock you in to renaming a column. For example, if your commit history looks like this: Datafold will understand that the production column `name` has been renamed to `first_name` in the PR branch. ## Handling column renaming in git commits and PR comments ### Git commits Git commits track changes on a change-by-change basis and linearize history assuming merged branches introduce new changes on top of the base/current branch (1st parent). ### PR comments PR comments apply changes to the entire changeset. ### When to use git commits or PR comments? When handling chained renames: * **Git commits:** Sequential renames (`col1 > col2 > col3`) result in the final rename (`col1 > col3`). * **PR comments:** It's best to specify the final result directly (`col1 > col3`). Sequential renames (`col1 > col2 > col3`) can also work, but specifying the final state simplifies understanding during review. | Aspect | Git Commits | PR Comments | | ------------------------- | ----------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | **Tracking Changes** | Tracks changes on a change-by-change basis. | Applies changes to the entire changeset. | | **History Linearization** | Linearizes history assuming merged branches introduce new changes on top of the base/current branch (1st parent). | N/A | | **Chained Renames** | Sequential renames (col1 > col2 > col3) result in the final rename (col1 > col3). | It's best to specify the final result directly (col1 > col3). Sequential renames (col1 > col2 > col3) can also work, but specifying the final state simplifies understanding during review. | | **Precedence** | Renames specified in git commits are applied in sequence unless overridden by subsequent commits. | PR comments take precedence over renames specified in git commits if applied during the review process. | These guidelines ensure consistency and clarity when managing column renaming in collaborative development environments, leveraging Datafold's capabilities effectively. --- # Source: https://docs.datafold.com/deployment-testing/configuration.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Configuration > Explore configuration options for CI/CD testing in Datafold. Learn how Datafold infers primary keys for accurate Data Diffs. Map renamed columns in PRs to their production counterparts. Configure when Datafold runs in CI, including on-demand triggers. Set model-specific filters and configurations for CI runs. --- # Source: https://docs.datafold.com/data-diff/connection-budgets.md # Connection Budgets > How connection budgets are enforced across data diffs in Datafold ## Overview Datafold now supports **shared connection budgeting** across * in-database diffs * cross-database diffs * in-memory diffs This feature ensures consistent, predictable behavior for database usage across the system—particularly important in environments with limited database connection capacity. *** ## ✨ Shared Connection Budgeting Datafold now enforces a **shared connection limit per database** across all supported diff runs. When a maximum number of connections is configured on a data source, this limit is respected **collectively** across all running diffs that target that source—regardless of the type of diff. This ensures that no combination of diff runs will exceed the specified connection cap for the database, providing: * ✅ More predictable resource usage * ✅ Protection against overloading the database * ✅ Simpler configuration and expectation management Connection limits are enforced automatically once set—no need to configure them at the individual diff level. *** ## ✅ Scope of This Feature | Jobs | Connection Budget Applied? | | -------------------- | -------------------------- | | in-database diffs | ✅ Yes | | cross-database diffs | ✅ Yes | | in-memory diffs | ✅ Yes | | Schema Fetching | ❌ No | | Lineage & Profiling | ❌ No | | SQL History | ❌ No | | Monitors | ❌ No | *** ## ⚙️ Configuration Shared connection budgeting is controlled via your **data source configuration**. Once a `Max Connections` limit is set, it will be automatically enforced **across all supported diff runs** targeting that database. ## 📬 Feedback Questions, suggestions, or unexpected behavior? Reach out to the Datafold team via your usual support or engineering channels. *** --- # Source: https://docs.datafold.com/api-reference/monitors/create-a-data-diff-monitor.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Create a Data Diff Monitor ## OpenAPI ````yaml openapi-public.json post /api/v1/monitors/create/diff openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/monitors/create/diff: post: tags: - Monitors summary: Create a Data Diff Monitor operationId: create_monitor_diff_api_v1_monitors_create_diff_post requestBody: content: application/json: schema: $ref: '#/components/schemas/DataDiffMonitorSpecPublic' required: true responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiPublicCreateMonitorOut' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: DataDiffMonitorSpecPublic: properties: alert: anyOf: - $ref: >- #/components/schemas/datafold__monitors__schemas__DiffAlertCondition - type: 'null' description: Condition for triggering alerts based on the data diff. datadiff: description: Configuration for the data diff. discriminator: mapping: indb: '#/components/schemas/InDbDataDiffConfig' inmem: '#/components/schemas/InMemDataDiffConfig' propertyName: diff_type oneOf: - $ref: '#/components/schemas/InDbDataDiffConfig' - $ref: '#/components/schemas/InMemDataDiffConfig' description: anyOf: - type: string - type: 'null' description: The description of the monitor. title: Description enabled: default: true description: Indicates whether the monitor is enabled. title: Enabled type: boolean name: description: The name of the monitor. title: Name type: string notifications: description: Notification configuration for the monitor. items: discriminator: mapping: email: '#/components/schemas/EmailNotification' pagerduty: '#/components/schemas/PagerDutyNotification' slack: '#/components/schemas/SlackNotification' teams: '#/components/schemas/TeamsNotification' webhook: '#/components/schemas/WebhookNotification' propertyName: type oneOf: - $ref: '#/components/schemas/EmailNotification' - $ref: '#/components/schemas/PagerDutyNotification' - $ref: '#/components/schemas/WebhookNotification' - $ref: '#/components/schemas/SlackNotification' - $ref: '#/components/schemas/TeamsNotification' title: Notifications type: array schedule: anyOf: - $ref: '#/components/schemas/IntervalSchedule' - $ref: '#/components/schemas/CronSchedule' - $ref: '#/components/schemas/NoneSchedule' description: The schedule at which the monitor runs. tags: description: Tags associated with the monitor. items: type: string title: Tags type: array required: - schedule - name - datadiff title: DataDiffMonitorSpecPublic type: object ApiPublicCreateMonitorOut: properties: id: description: Unique identifier for the monitor. title: Id type: integer required: - id title: ApiPublicCreateMonitorOut type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object datafold__monitors__schemas__DiffAlertCondition: properties: different_rows_count: anyOf: - type: integer - type: 'null' description: >- Threshold for the number of different rows allowed between the datasets. title: Different Rows Count different_rows_percent: anyOf: - type: integer - type: 'null' description: >- Threshold for the percentage of different rows allowed between the datasets. title: Different Rows Percent title: Diff Conditions type: object InDbDataDiffConfig: properties: column_remapping: anyOf: - additionalProperties: type: string type: object - type: 'null' description: Mapping of columns from one dataset to another for comparison. title: Column Remapping columns_to_compare: anyOf: - items: type: string type: array - type: 'null' description: Optional list of columns to compare between the datasets. title: Columns To Compare dataset_a: anyOf: - $ref: '#/components/schemas/InDbTableDataset' - $ref: '#/components/schemas/InDbQueryDataset' description: The first dataset to compare. dataset_b: anyOf: - $ref: '#/components/schemas/InDbTableDataset' - $ref: '#/components/schemas/InDbQueryDataset' description: The second dataset to compare. ignore_string_case: default: false description: Indicates whether to ignore case differences in string comparisons. title: Ignore String Case type: boolean materialize_results: default: false description: Indicates whether to materialize the results of the comparison. title: Materialize Results type: boolean primary_key: description: List of columns that make up the primary key for the datasets. items: type: string title: Primary Key type: array sampling: anyOf: - $ref: '#/components/schemas/ToleranceBasedSampling' - $ref: '#/components/schemas/PercentageSampling' - $ref: '#/components/schemas/MaxRowsSampling' - type: 'null' description: Sampling configuration for the data comparison. timeseries_dimension_column: anyOf: - type: string - type: 'null' description: Column used for time series dimensioning in the comparison. title: Timeseries Dimension Column tolerance: anyOf: - $ref: '#/components/schemas/DataDiffToleranceConfig' - type: 'null' description: Configuration for tolerance. required: - primary_key - dataset_a - dataset_b title: In-Database type: object InMemDataDiffConfig: properties: column_remapping: anyOf: - additionalProperties: type: string type: object - type: 'null' description: Mapping of columns from one dataset to another for comparison. title: Column Remapping columns_to_compare: anyOf: - items: type: string type: array - type: 'null' description: Optional list of columns to compare between the datasets. title: Columns To Compare dataset_a: anyOf: - $ref: '#/components/schemas/XdbTableDataset' - $ref: '#/components/schemas/XdbQueryDataset' description: The first dataset to compare. title: Dataset A dataset_b: anyOf: - $ref: '#/components/schemas/XdbTableDataset' - $ref: '#/components/schemas/XdbQueryDataset' description: The second dataset to compare. title: Dataset B ignore_string_case: default: false description: Indicates whether to ignore case differences in string comparisons. title: Ignore String Case type: boolean materialize_results: default: false description: Indicates whether to materialize the results of the comparison. title: Materialize Results type: boolean materialize_results_to: anyOf: - type: integer - type: 'null' description: Identifier for the destination where results should be materialized. title: Materialize Results To primary_key: description: List of columns that make up the primary key for the datasets. items: type: string title: Primary Key type: array sampling: anyOf: - $ref: '#/components/schemas/ToleranceBasedSampling' - $ref: '#/components/schemas/PercentageSampling' - $ref: '#/components/schemas/MaxRowsSampling' - type: 'null' description: Sampling configuration for the data comparison. title: Sampling tolerance: anyOf: - $ref: '#/components/schemas/DataDiffToleranceConfig' - type: 'null' description: Configuration for tolerance. required: - primary_key - dataset_a - dataset_b title: In-Memory type: object EmailNotification: properties: features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features recipients: description: A list of email addresses to receive the notification. items: type: string title: Recipients type: array type: const: email default: email title: Type type: string required: - recipients title: Email type: object PagerDutyNotification: properties: features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer type: const: pagerduty default: pagerduty title: Type type: string required: - integration title: PagerDuty type: object WebhookNotification: properties: features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer type: const: webhook default: webhook title: Type type: string required: - integration title: Webhook type: object SlackNotification: properties: channel: description: The channel through which the notification will be sent. title: Channel type: string features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer mentions: description: A list of mentions to include in the notification. items: type: string title: Mentions type: array type: const: slack default: slack title: Type type: string required: - integration - channel title: Slack type: object TeamsNotification: properties: channel: description: The channel through which the notification will be sent. title: Channel type: string features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer mentions: description: A list of mentions names to include in the notification. items: type: string title: Mentions type: array type: const: teams default: teams title: Type type: string required: - integration - channel title: Teams type: object IntervalSchedule: properties: interval: anyOf: - $ref: '#/components/schemas/HourIntervalSchedule' - $ref: '#/components/schemas/DayIntervalSchedule' description: Specifies the scheduling interval. required: - interval title: Interval type: object CronSchedule: properties: cron: description: The cron expression that defines the schedule. title: Cron type: string type: const: crontab default: crontab title: Type type: string required: - cron title: Cron type: object NoneSchedule: properties: type: const: none default: none title: Type type: string title: None type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object InDbTableDataset: properties: connection_id: description: The identifier for the data source configuration. title: Connection Id type: integer filter: anyOf: - type: string - type: 'null' description: Filter condition for querying the dataset. title: Filter session_parameters: anyOf: - additionalProperties: true type: object - type: 'null' description: Session parameters for the database session. title: Session Parameters table: description: The table in the format 'db.schema.table'. title: Table type: string time_travel_point: anyOf: - type: string - type: integer - type: 'null' description: Point in time for querying historical data. title: Time Travel Point required: - connection_id - table title: Table type: object InDbQueryDataset: properties: connection_id: description: The identifier for the data source configuration. title: Connection Id type: integer query: description: The SQL query to be evaluated. title: Query type: string session_parameters: anyOf: - additionalProperties: true type: object - type: 'null' description: Parameters for the database session. title: Session Parameters time_travel_point: anyOf: - type: string - type: integer - type: 'null' description: Point in time for querying historical data. title: Time Travel Point required: - connection_id - query title: Query type: object ToleranceBasedSampling: properties: confidence: description: The confidence level for the sampling results. title: Confidence type: number threshold: anyOf: - type: integer - type: 'null' description: Threshold for triggering actions based on sampling. title: Threshold tolerance: description: The allowable margin of error for sampling. title: Tolerance type: number required: - tolerance - confidence title: Tolerance type: object PercentageSampling: properties: rate: description: The sampling rate as a percentage. title: Rate type: number threshold: anyOf: - type: integer - type: 'null' description: Threshold for triggering actions based on sampling. title: Threshold required: - rate title: Percentage type: object MaxRowsSampling: properties: max_rows: description: The maximum number of rows to sample. title: Max Rows type: integer threshold: anyOf: - type: integer - type: 'null' description: Threshold for triggering actions based on sampling. title: Threshold required: - max_rows title: MaxRows type: object DataDiffToleranceConfig: properties: float: anyOf: - $ref: '#/components/schemas/ColumnToleranceConfig' - type: 'null' description: Configuration for float columns tolerance. title: DataDiffToleranceConfig type: object XdbTableDataset: properties: connection_id: description: The identifier for the data source configuration. title: Connection Id type: integer filter: anyOf: - type: string - type: 'null' description: Filter condition for querying the dataset. title: Filter materialize: default: true description: Indicates whether to materialize the dataset. title: Materialize type: boolean session_parameters: anyOf: - additionalProperties: true type: object - type: 'null' description: Session parameters for the database session. title: Session Parameters table: description: The table in the format 'db.schema.table'. title: Table type: string required: - connection_id - table title: Table type: object XdbQueryDataset: properties: connection_id: description: The identifier for the data source configuration. title: Connection Id type: integer materialize: default: true description: Indicates whether to materialize the dataset. title: Materialize type: boolean query: description: The SQL query to be evaluated. title: Query type: string session_parameters: anyOf: - additionalProperties: true type: object - type: 'null' description: Parameters for the database session. title: Session Parameters required: - connection_id - query title: Query type: object DestinationFeatures: enum: - attach_csv - notify_first_triggered_only - disable_recovery_notifications - notify_every_run title: DestinationFeatures type: string HourIntervalSchedule: properties: every: const: hour title: Every type: string type: const: hourly default: hourly title: Type type: string required: - every title: Hour type: object DayIntervalSchedule: properties: every: const: day title: Every type: string hour: anyOf: - type: integer - type: 'null' description: The hour at which the monitor should trigger. (0 - 23) title: Hour type: const: daily default: daily title: Type type: string utc_at: anyOf: - format: time type: string - type: 'null' description: The UTC time at which the monitor should trigger. title: Utc At required: - every title: Day type: object ColumnToleranceConfig: properties: column_tolerance: anyOf: - additionalProperties: discriminator: mapping: absolute: '#/components/schemas/AbsoluteColumnTolerance' relative: '#/components/schemas/RelativeColumnTolerance' propertyName: type oneOf: - $ref: '#/components/schemas/RelativeColumnTolerance' - $ref: '#/components/schemas/AbsoluteColumnTolerance' type: object - type: 'null' description: Specific tolerance per column. title: Column Tolerance default: anyOf: - discriminator: mapping: absolute: '#/components/schemas/AbsoluteColumnTolerance' relative: '#/components/schemas/RelativeColumnTolerance' propertyName: type oneOf: - $ref: '#/components/schemas/RelativeColumnTolerance' - $ref: '#/components/schemas/AbsoluteColumnTolerance' - type: 'null' description: Default tolerance applied to all columns. title: Default title: ColumnToleranceConfig type: object RelativeColumnTolerance: properties: type: const: relative default: relative description: The type of Column Tolerance. title: Type type: string value: anyOf: - type: number - type: integer description: Value of Column Tolerance. title: Value required: - value title: Relative type: object AbsoluteColumnTolerance: properties: type: const: absolute default: absolute description: The type of Column Tolerance. title: Type type: string value: description: Value of Column Tolerance. title: Value type: number required: - value title: Absolute type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/data-diffs/create-a-data-diff.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Create a data diff > Launches a new data diff to compare two datasets (tables or queries). A data diff identifies differences between two datasets by comparing: - Row-level changes (added, removed, modified rows) - Schema differences - Column-level statistics The diff runs asynchronously. Use the returned diff ID to poll for status and retrieve results. ## OpenAPI ````yaml post /api/v1/datadiffs openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/datadiffs: post: tags: - Data diffs - diff_created summary: Create a data diff description: >- Launches a new data diff to compare two datasets (tables or queries). A data diff identifies differences between two datasets by comparing: - Row-level changes (added, removed, modified rows) - Schema differences - Column-level statistics The diff runs asynchronously. Use the returned diff ID to poll for status and retrieve results. operationId: create_datadiff requestBody: content: application/json: schema: $ref: '#/components/schemas/ApiDataDiffData' required: true responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiDataDiffFull' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ApiDataDiffData: properties: archived: default: false title: Archived type: boolean bisection_factor: anyOf: - type: integer - type: 'null' title: Bisection Factor bisection_threshold: anyOf: - type: integer - type: 'null' title: Bisection Threshold column_mapping: anyOf: - items: maxItems: 2 minItems: 2 prefixItems: - type: string - type: string type: array type: array - type: 'null' title: Column Mapping columns_to_compare: anyOf: - items: type: string type: array - type: 'null' title: Columns To Compare compare_duplicates: anyOf: - type: boolean - type: 'null' title: Compare Duplicates data_source1_id: title: Data Source1 Id type: integer data_source1_session_parameters: anyOf: - additionalProperties: true type: object - type: 'null' title: Data Source1 Session Parameters data_source2_id: title: Data Source2 Id type: integer data_source2_session_parameters: anyOf: - additionalProperties: true type: object - type: 'null' title: Data Source2 Session Parameters datetime_tolerance: anyOf: - type: integer - type: 'null' title: Datetime Tolerance diff_tolerance: anyOf: - type: number - type: 'null' title: Diff Tolerance diff_tolerances_per_column: anyOf: - items: $ref: '#/components/schemas/ColumnTolerance' type: array - type: 'null' title: Diff Tolerances Per Column download_limit: anyOf: - type: integer - type: 'null' title: Download Limit exclude_columns: anyOf: - items: type: string type: array - type: 'null' title: Exclude Columns file1: anyOf: - format: uri minLength: 1 type: string - type: 'null' title: File1 file1_options: anyOf: - discriminator: mapping: csv: '#/components/schemas/CSVFileOptions' excel: '#/components/schemas/ExcelFileOptions' parquet: '#/components/schemas/ParquetFileOptions' propertyName: file_type oneOf: - $ref: '#/components/schemas/CSVFileOptions' - $ref: '#/components/schemas/ExcelFileOptions' - $ref: '#/components/schemas/ParquetFileOptions' - type: 'null' title: File1 Options file2: anyOf: - format: uri minLength: 1 type: string - type: 'null' title: File2 file2_options: anyOf: - discriminator: mapping: csv: '#/components/schemas/CSVFileOptions' excel: '#/components/schemas/ExcelFileOptions' parquet: '#/components/schemas/ParquetFileOptions' propertyName: file_type oneOf: - $ref: '#/components/schemas/CSVFileOptions' - $ref: '#/components/schemas/ExcelFileOptions' - $ref: '#/components/schemas/ParquetFileOptions' - type: 'null' title: File2 Options filter1: anyOf: - type: string - type: 'null' title: Filter1 filter2: anyOf: - type: string - type: 'null' title: Filter2 include_columns: anyOf: - items: type: string type: array - type: 'null' title: Include Columns materialization_destination_id: anyOf: - type: integer - type: 'null' title: Materialization Destination Id materialize_dataset1: anyOf: - type: boolean - type: 'null' title: Materialize Dataset1 materialize_dataset2: anyOf: - type: boolean - type: 'null' title: Materialize Dataset2 materialize_without_sampling: anyOf: - type: boolean - type: 'null' default: false title: Materialize Without Sampling per_column_diff_limit: anyOf: - type: integer - type: 'null' title: Per Column Diff Limit pk_columns: items: type: string title: Pk Columns type: array purged: default: false title: Purged type: boolean query1: anyOf: - type: string - type: 'null' title: Query1 query2: anyOf: - type: string - type: 'null' title: Query2 run_profiles: anyOf: - type: boolean - type: 'null' title: Run Profiles sampling_confidence: anyOf: - type: number - type: 'null' title: Sampling Confidence sampling_max_rows: anyOf: - type: integer - type: 'null' title: Sampling Max Rows sampling_ratio: anyOf: - type: number - type: 'null' title: Sampling Ratio sampling_threshold: anyOf: - type: integer - type: 'null' title: Sampling Threshold sampling_tolerance: anyOf: - type: number - type: 'null' title: Sampling Tolerance table1: anyOf: - items: type: string type: array - type: 'null' title: Table1 table2: anyOf: - items: type: string type: array - type: 'null' title: Table2 table_modifiers: anyOf: - items: $ref: '#/components/schemas/TableModifiers' type: array - type: 'null' title: Table Modifiers tags: anyOf: - items: type: string type: array - type: 'null' title: Tags time_aggregate: anyOf: - $ref: '#/components/schemas/TimeAggregateEnum' - type: 'null' time_column: anyOf: - type: string - type: 'null' title: Time Column time_interval_end: anyOf: - format: date-time type: string - type: 'null' title: Time Interval End time_interval_start: anyOf: - format: date-time type: string - type: 'null' title: Time Interval Start time_travel_point1: anyOf: - type: integer - format: date-time type: string - type: string - type: 'null' title: Time Travel Point1 time_travel_point2: anyOf: - type: integer - format: date-time type: string - type: string - type: 'null' title: Time Travel Point2 tolerance_mode: anyOf: - $ref: '#/components/schemas/ToleranceModeEnum' - type: 'null' required: - data_source1_id - data_source2_id - pk_columns title: ApiDataDiffData type: object ApiDataDiffFull: properties: affected_columns: anyOf: - items: type: string type: array - type: 'null' title: Affected Columns algorithm: anyOf: - $ref: '#/components/schemas/DiffAlgorithm' - type: 'null' archived: default: false title: Archived type: boolean bisection_factor: anyOf: - type: integer - type: 'null' title: Bisection Factor bisection_threshold: anyOf: - type: integer - type: 'null' title: Bisection Threshold ci_base_branch: anyOf: - type: string - type: 'null' title: Ci Base Branch ci_pr_branch: anyOf: - type: string - type: 'null' title: Ci Pr Branch ci_pr_num: anyOf: - type: integer - type: 'null' title: Ci Pr Num ci_pr_sha: anyOf: - type: string - type: 'null' title: Ci Pr Sha ci_pr_url: anyOf: - type: string - type: 'null' title: Ci Pr Url ci_pr_user_display_name: anyOf: - type: string - type: 'null' title: Ci Pr User Display Name ci_pr_user_email: anyOf: - type: string - type: 'null' title: Ci Pr User Email ci_pr_user_id: anyOf: - type: string - type: 'null' title: Ci Pr User Id ci_pr_username: anyOf: - type: string - type: 'null' title: Ci Pr Username ci_run_id: anyOf: - type: integer - type: 'null' title: Ci Run Id ci_sha_url: anyOf: - type: string - type: 'null' title: Ci Sha Url column_mapping: anyOf: - items: maxItems: 2 minItems: 2 prefixItems: - type: string - type: string type: array type: array - type: 'null' title: Column Mapping columns_to_compare: anyOf: - items: type: string type: array - type: 'null' title: Columns To Compare compare_duplicates: anyOf: - type: boolean - type: 'null' title: Compare Duplicates created_at: anyOf: - format: date-time type: string - type: 'null' title: Created At data_app_metadata: anyOf: - $ref: '#/components/schemas/TDataDiffDataAppMetadata' - type: 'null' data_app_type: anyOf: - type: string - type: 'null' title: Data App Type data_source1_id: title: Data Source1 Id type: integer data_source1_session_parameters: anyOf: - additionalProperties: true type: object - type: 'null' title: Data Source1 Session Parameters data_source2_id: title: Data Source2 Id type: integer data_source2_session_parameters: anyOf: - additionalProperties: true type: object - type: 'null' title: Data Source2 Session Parameters datetime_tolerance: anyOf: - type: integer - type: 'null' title: Datetime Tolerance diff_stats: anyOf: - $ref: '#/components/schemas/DiffStats' - type: 'null' diff_tolerance: anyOf: - type: number - type: 'null' title: Diff Tolerance diff_tolerances_per_column: anyOf: - items: $ref: '#/components/schemas/ColumnTolerance' type: array - type: 'null' title: Diff Tolerances Per Column done: anyOf: - type: boolean - type: 'null' default: false title: Done download_limit: anyOf: - type: integer - type: 'null' title: Download Limit exclude_columns: anyOf: - items: type: string type: array - type: 'null' title: Exclude Columns execute_as_user: anyOf: - type: boolean - type: 'null' title: Execute As User file1: anyOf: - format: uri minLength: 1 type: string - type: 'null' title: File1 file1_options: anyOf: - discriminator: mapping: csv: '#/components/schemas/CSVFileOptions' excel: '#/components/schemas/ExcelFileOptions' parquet: '#/components/schemas/ParquetFileOptions' propertyName: file_type oneOf: - $ref: '#/components/schemas/CSVFileOptions' - $ref: '#/components/schemas/ExcelFileOptions' - $ref: '#/components/schemas/ParquetFileOptions' - type: 'null' title: File1 Options file2: anyOf: - format: uri minLength: 1 type: string - type: 'null' title: File2 file2_options: anyOf: - discriminator: mapping: csv: '#/components/schemas/CSVFileOptions' excel: '#/components/schemas/ExcelFileOptions' parquet: '#/components/schemas/ParquetFileOptions' propertyName: file_type oneOf: - $ref: '#/components/schemas/CSVFileOptions' - $ref: '#/components/schemas/ExcelFileOptions' - $ref: '#/components/schemas/ParquetFileOptions' - type: 'null' title: File2 Options filter1: anyOf: - type: string - type: 'null' title: Filter1 filter2: anyOf: - type: string - type: 'null' title: Filter2 finished_at: anyOf: - format: date-time type: string - type: 'null' title: Finished At id: anyOf: - type: integer - type: 'null' title: Id include_columns: anyOf: - items: type: string type: array - type: 'null' title: Include Columns kind: $ref: '#/components/schemas/DiffKind' materialization_destination_id: anyOf: - type: integer - type: 'null' title: Materialization Destination Id materialize_dataset1: anyOf: - type: boolean - type: 'null' title: Materialize Dataset1 materialize_dataset2: anyOf: - type: boolean - type: 'null' title: Materialize Dataset2 materialize_without_sampling: anyOf: - type: boolean - type: 'null' default: false title: Materialize Without Sampling monitor_error: anyOf: - $ref: '#/components/schemas/QueryError' - type: 'null' monitor_id: anyOf: - type: integer - type: 'null' title: Monitor Id monitor_state: anyOf: - $ref: '#/components/schemas/MonitorRunState' - type: 'null' per_column_diff_limit: anyOf: - type: integer - type: 'null' title: Per Column Diff Limit pk_columns: items: type: string title: Pk Columns type: array purged: default: false title: Purged type: boolean query1: anyOf: - type: string - type: 'null' title: Query1 query2: anyOf: - type: string - type: 'null' title: Query2 result: anyOf: - enum: - error - bad-pks - different - missing-pks - identical - empty type: string - type: 'null' title: Result result_revisions: additionalProperties: type: integer default: {} title: Result Revisions type: object result_statuses: anyOf: - additionalProperties: type: string type: object - type: 'null' title: Result Statuses run_profiles: anyOf: - type: boolean - type: 'null' title: Run Profiles runtime: anyOf: - type: number - type: 'null' title: Runtime sampling_confidence: anyOf: - type: number - type: 'null' title: Sampling Confidence sampling_max_rows: anyOf: - type: integer - type: 'null' title: Sampling Max Rows sampling_ratio: anyOf: - type: number - type: 'null' title: Sampling Ratio sampling_threshold: anyOf: - type: integer - type: 'null' title: Sampling Threshold sampling_tolerance: anyOf: - type: number - type: 'null' title: Sampling Tolerance source: anyOf: - $ref: '#/components/schemas/JobSource' - type: 'null' status: anyOf: - $ref: '#/components/schemas/JobStatus' - type: 'null' table1: anyOf: - items: type: string type: array - type: 'null' title: Table1 table2: anyOf: - items: type: string type: array - type: 'null' title: Table2 table_modifiers: anyOf: - items: $ref: '#/components/schemas/TableModifiers' type: array - type: 'null' title: Table Modifiers tags: anyOf: - items: type: string type: array - type: 'null' title: Tags temp_schema_override: anyOf: - items: type: string type: array - type: 'null' title: Temp Schema Override time_aggregate: anyOf: - $ref: '#/components/schemas/TimeAggregateEnum' - type: 'null' time_column: anyOf: - type: string - type: 'null' title: Time Column time_interval_end: anyOf: - format: date-time type: string - type: 'null' title: Time Interval End time_interval_start: anyOf: - format: date-time type: string - type: 'null' title: Time Interval Start time_travel_point1: anyOf: - type: integer - format: date-time type: string - type: string - type: 'null' title: Time Travel Point1 time_travel_point2: anyOf: - type: integer - format: date-time type: string - type: string - type: 'null' title: Time Travel Point2 tolerance_mode: anyOf: - $ref: '#/components/schemas/ToleranceModeEnum' - type: 'null' updated_at: anyOf: - format: date-time type: string - type: 'null' title: Updated At user_id: anyOf: - type: integer - type: 'null' title: User Id required: - data_source1_id - data_source2_id - pk_columns - kind title: ApiDataDiffFull type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ColumnTolerance: properties: column_name: title: Column Name type: string tolerance_mode: $ref: '#/components/schemas/ToleranceModeEnum' tolerance_value: title: Tolerance Value type: number required: - column_name - tolerance_value - tolerance_mode title: ColumnTolerance type: object CSVFileOptions: properties: delimiter: anyOf: - type: string - type: 'null' title: Delimiter file_type: const: csv default: csv title: File Type type: string skip_head_rows: anyOf: - type: integer - type: 'null' title: Skip Head Rows skip_tail_rows: anyOf: - type: integer - type: 'null' title: Skip Tail Rows title: CSVFileOptions type: object ExcelFileOptions: properties: file_type: const: excel default: excel title: File Type type: string sheet: anyOf: - type: string - type: 'null' title: Sheet skip_head_rows: anyOf: - type: integer - type: 'null' title: Skip Head Rows skip_tail_rows: anyOf: - type: integer - type: 'null' title: Skip Tail Rows title: ExcelFileOptions type: object ParquetFileOptions: properties: file_type: const: parquet default: parquet title: File Type type: string title: ParquetFileOptions type: object TableModifiers: enum: - case_insensitive_strings title: TableModifiers type: string TimeAggregateEnum: enum: - minute - hour - day - week - month - year title: TimeAggregateEnum type: string ToleranceModeEnum: enum: - absolute - relative title: ToleranceModeEnum type: string DiffAlgorithm: enum: - join - hash - hash_v2_alpha - fetch_and_join title: DiffAlgorithm type: string TDataDiffDataAppMetadata: properties: data_app_id: title: Data App Id type: integer data_app_model1_id: anyOf: - type: string - type: 'null' title: Data App Model1 Id data_app_model1_name: anyOf: - type: string - type: 'null' title: Data App Model1 Name data_app_model2_id: anyOf: - type: string - type: 'null' title: Data App Model2 Id data_app_model2_name: anyOf: - type: string - type: 'null' title: Data App Model2 Name data_app_model_type: title: Data App Model Type type: string meta_data: additionalProperties: true title: Meta Data type: object required: - data_app_id - data_app_model_type - meta_data title: TDataDiffDataAppMetadata type: object DiffStats: properties: diff_duplicate_pks: anyOf: - type: number - type: 'null' title: Diff Duplicate Pks diff_null_pks: anyOf: - type: number - type: 'null' title: Diff Null Pks diff_pks: anyOf: - type: number - type: 'null' title: Diff Pks diff_rows: anyOf: - type: number - type: 'null' title: Diff Rows diff_rows_count: anyOf: - type: integer - type: 'null' title: Diff Rows Count diff_rows_number: anyOf: - type: number - type: 'null' title: Diff Rows Number diff_schema: anyOf: - type: number - type: 'null' title: Diff Schema diff_values: anyOf: - type: number - type: 'null' title: Diff Values errors: anyOf: - type: integer - type: 'null' title: Errors match_ratio: anyOf: - type: number - type: 'null' title: Match Ratio rows_added: anyOf: - type: integer - type: 'null' title: Rows Added rows_removed: anyOf: - type: integer - type: 'null' title: Rows Removed sampled: anyOf: - type: boolean - type: 'null' title: Sampled table_a_row_count: anyOf: - type: integer - type: 'null' title: Table A Row Count table_b_row_count: anyOf: - type: integer - type: 'null' title: Table B Row Count version: title: Version type: string required: - version title: DiffStats type: object DiffKind: enum: - in_db - cross_db title: DiffKind type: string QueryError: properties: error_type: title: Error Type type: string error_value: title: Error Value type: string required: - error_type - error_value title: QueryError type: object MonitorRunState: enum: - ok - alert - error - learning - checking - created - skipped - cancelled title: MonitorRunState type: string JobSource: enum: - interactive - demo_signup - manual - api - ci - schedule - auto title: JobSource type: string JobStatus: enum: - needs_confirmation - needs_authentication - waiting - processing - done - failed - cancelled title: JobStatus type: string ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/data-sources/create-a-data-source.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Create a data source ## OpenAPI ````yaml post /api/v1/data_sources openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/data_sources: post: tags: - Data sources - data_source_added summary: Create a data source operationId: create_new_datasource_api_v1_data_sources_post requestBody: content: application/json: schema: $ref: '#/components/schemas/ApiDataSourceForm' required: true responses: '200': content: application/json: schema: discriminator: mapping: athena: '#/components/schemas/ApiDataSourceAwsAthena' aws_s3: '#/components/schemas/ApiDataSourceS3' azure_synapse: '#/components/schemas/ApiDataSourceAzureSynapse' bigquery: '#/components/schemas/ApiDataSourceBigQuery' databricks: '#/components/schemas/ApiDataSourceDatabricks' dremio: '#/components/schemas/ApiDataSourceDremio' duckdb: '#/components/schemas/ApiDataSourceDuckDB' files_azure_datalake: '#/components/schemas/ApiDataSourceAzureDataLake' google_cloud_storage: '#/components/schemas/ApiDataSourceGCS' mariadb: '#/components/schemas/ApiDataSourceMariaDB' microsoft_fabric: '#/components/schemas/ApiDataSourceMicrosoftFabric' mongodb: '#/components/schemas/ApiDataSourceMongoDB' mssql: '#/components/schemas/ApiDataSourceMSSQL' mysql: '#/components/schemas/ApiDataSourceMySQL' netezza: '#/components/schemas/ApiDataSourceNetezza' oracle: '#/components/schemas/ApiDataSourceOracle' pg: '#/components/schemas/ApiDataSourcePostgres' postgres_aurora: '#/components/schemas/ApiDataSourcePostgresAurora' postgres_aws_rds: '#/components/schemas/ApiDataSourcePostgresRds' redshift: '#/components/schemas/ApiDataSourceRedshift' sap_hana: '#/components/schemas/ApiDataSourceSapHana' snowflake: '#/components/schemas/ApiDataSourceSnowflake' starburst: '#/components/schemas/ApiDataSourceStarburst' teradata: '#/components/schemas/ApiDataSourceTeradata' trino: '#/components/schemas/ApiDataSourceTrino' vertica: '#/components/schemas/ApiDataSourceVertica' propertyName: type oneOf: - $ref: '#/components/schemas/ApiDataSourceBigQuery' - $ref: '#/components/schemas/ApiDataSourceDatabricks' - $ref: '#/components/schemas/ApiDataSourceDuckDB' - $ref: '#/components/schemas/ApiDataSourceMongoDB' - $ref: '#/components/schemas/ApiDataSourceMySQL' - $ref: '#/components/schemas/ApiDataSourceMariaDB' - $ref: '#/components/schemas/ApiDataSourceMSSQL' - $ref: '#/components/schemas/ApiDataSourceOracle' - $ref: '#/components/schemas/ApiDataSourcePostgres' - $ref: '#/components/schemas/ApiDataSourcePostgresAurora' - $ref: '#/components/schemas/ApiDataSourcePostgresRds' - $ref: '#/components/schemas/ApiDataSourceRedshift' - $ref: '#/components/schemas/ApiDataSourceTeradata' - $ref: '#/components/schemas/ApiDataSourceSapHana' - $ref: '#/components/schemas/ApiDataSourceAwsAthena' - $ref: '#/components/schemas/ApiDataSourceSnowflake' - $ref: '#/components/schemas/ApiDataSourceDremio' - $ref: '#/components/schemas/ApiDataSourceStarburst' - $ref: '#/components/schemas/ApiDataSourceNetezza' - $ref: '#/components/schemas/ApiDataSourceAzureDataLake' - $ref: '#/components/schemas/ApiDataSourceGCS' - $ref: '#/components/schemas/ApiDataSourceS3' - $ref: '#/components/schemas/ApiDataSourceAzureSynapse' - $ref: '#/components/schemas/ApiDataSourceMicrosoftFabric' - $ref: '#/components/schemas/ApiDataSourceVertica' - $ref: '#/components/schemas/ApiDataSourceTrino' title: Response Create New Datasource Api V1 Data Sources Post description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ApiDataSourceForm: discriminator: mapping: athena: '#/components/schemas/ApiDataSourceAwsAthena' aws_s3: '#/components/schemas/ApiDataSourceS3' azure_synapse: '#/components/schemas/ApiDataSourceAzureSynapse' bigquery: '#/components/schemas/ApiDataSourceBigQuery' databricks: '#/components/schemas/ApiDataSourceDatabricks' dremio: '#/components/schemas/ApiDataSourceDremio' duckdb: '#/components/schemas/ApiDataSourceDuckDB' files_azure_datalake: '#/components/schemas/ApiDataSourceAzureDataLake' google_cloud_storage: '#/components/schemas/ApiDataSourceGCS' mariadb: '#/components/schemas/ApiDataSourceMariaDB' microsoft_fabric: '#/components/schemas/ApiDataSourceMicrosoftFabric' mongodb: '#/components/schemas/ApiDataSourceMongoDB' mssql: '#/components/schemas/ApiDataSourceMSSQL' mysql: '#/components/schemas/ApiDataSourceMySQL' netezza: '#/components/schemas/ApiDataSourceNetezza' oracle: '#/components/schemas/ApiDataSourceOracle' pg: '#/components/schemas/ApiDataSourcePostgres' postgres_aurora: '#/components/schemas/ApiDataSourcePostgresAurora' postgres_aws_rds: '#/components/schemas/ApiDataSourcePostgresRds' redshift: '#/components/schemas/ApiDataSourceRedshift' sap_hana: '#/components/schemas/ApiDataSourceSapHana' snowflake: '#/components/schemas/ApiDataSourceSnowflake' starburst: '#/components/schemas/ApiDataSourceStarburst' teradata: '#/components/schemas/ApiDataSourceTeradata' trino: '#/components/schemas/ApiDataSourceTrino' vertica: '#/components/schemas/ApiDataSourceVertica' propertyName: type oneOf: - $ref: '#/components/schemas/ApiDataSourceBigQuery' - $ref: '#/components/schemas/ApiDataSourceDatabricks' - $ref: '#/components/schemas/ApiDataSourceDuckDB' - $ref: '#/components/schemas/ApiDataSourceMongoDB' - $ref: '#/components/schemas/ApiDataSourceMySQL' - $ref: '#/components/schemas/ApiDataSourceMariaDB' - $ref: '#/components/schemas/ApiDataSourceMSSQL' - $ref: '#/components/schemas/ApiDataSourceOracle' - $ref: '#/components/schemas/ApiDataSourcePostgres' - $ref: '#/components/schemas/ApiDataSourcePostgresAurora' - $ref: '#/components/schemas/ApiDataSourcePostgresRds' - $ref: '#/components/schemas/ApiDataSourceRedshift' - $ref: '#/components/schemas/ApiDataSourceTeradata' - $ref: '#/components/schemas/ApiDataSourceSapHana' - $ref: '#/components/schemas/ApiDataSourceAwsAthena' - $ref: '#/components/schemas/ApiDataSourceSnowflake' - $ref: '#/components/schemas/ApiDataSourceDremio' - $ref: '#/components/schemas/ApiDataSourceStarburst' - $ref: '#/components/schemas/ApiDataSourceNetezza' - $ref: '#/components/schemas/ApiDataSourceAzureDataLake' - $ref: '#/components/schemas/ApiDataSourceGCS' - $ref: '#/components/schemas/ApiDataSourceS3' - $ref: '#/components/schemas/ApiDataSourceAzureSynapse' - $ref: '#/components/schemas/ApiDataSourceMicrosoftFabric' - $ref: '#/components/schemas/ApiDataSourceVertica' - $ref: '#/components/schemas/ApiDataSourceTrino' title: ApiDataSourceForm ApiDataSourceBigQuery: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/BigQueryConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: bigquery title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceBigQuery type: object ApiDataSourceDatabricks: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/DatabricksConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: databricks title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceDatabricks type: object ApiDataSourceDuckDB: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/DuckDBConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: duckdb title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceDuckDB type: object ApiDataSourceMongoDB: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/MongoDBConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: mongodb title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceMongoDB type: object ApiDataSourceMySQL: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/MySQLConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: mysql title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceMySQL type: object ApiDataSourceMariaDB: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/MariaDBConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: mariadb title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceMariaDB type: object ApiDataSourceMSSQL: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/MSSQLConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: mssql title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceMSSQL type: object ApiDataSourceOracle: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/OracleConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: oracle title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceOracle type: object ApiDataSourcePostgres: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/PostgreSQLConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: pg title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourcePostgres type: object ApiDataSourcePostgresAurora: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/PostgreSQLAuroraConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: postgres_aurora title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourcePostgresAurora type: object ApiDataSourcePostgresRds: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/PostgreSQLAuroraConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: postgres_aws_rds title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourcePostgresRds type: object ApiDataSourceRedshift: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/RedshiftConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: redshift title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceRedshift type: object ApiDataSourceTeradata: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/TeradataConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: teradata title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceTeradata type: object ApiDataSourceSapHana: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/SapHanaConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: sap_hana title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceSapHana type: object ApiDataSourceAwsAthena: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/AwsAthenaConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: athena title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceAwsAthena type: object ApiDataSourceSnowflake: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/SnowflakeConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: snowflake title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceSnowflake type: object ApiDataSourceDremio: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/DremioConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: dremio title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceDremio type: object ApiDataSourceStarburst: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/StarburstConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: starburst title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceStarburst type: object ApiDataSourceNetezza: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/NetezzaConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: netezza title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceNetezza type: object ApiDataSourceAzureDataLake: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/AzureDataLakeConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: files_azure_datalake title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceAzureDataLake type: object ApiDataSourceGCS: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/GCSConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: google_cloud_storage title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceGCS type: object ApiDataSourceS3: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/AWSS3Config' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: aws_s3 title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceS3 type: object ApiDataSourceAzureSynapse: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/MSSQLConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: azure_synapse title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceAzureSynapse type: object ApiDataSourceMicrosoftFabric: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/MicrosoftFabricConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: microsoft_fabric title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceMicrosoftFabric type: object ApiDataSourceVertica: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/VerticaConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: vertica title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceVertica type: object ApiDataSourceTrino: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/TrinoConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: trino title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceTrino type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ApiDataSourceTestStatus: properties: results: items: $ref: '#/components/schemas/TestResultStep' title: Results type: array tested_at: format: date-time title: Tested At type: string required: - tested_at - results title: ApiDataSourceTestStatus type: object BigQueryConfig: properties: extraProjectsToIndex: anyOf: - type: string - type: 'null' examples: - |- project1 project2 section: config title: List of extra projects to index (one per line) widget: multiline jsonKeyFile: format: password section: basic title: JSON Key File type: string writeOnly: true jsonOAuthKeyFile: anyOf: - format: password type: string writeOnly: true - type: 'null' section: basic title: JSON OAuth Key File location: default: US examples: - US section: basic title: Processing Location type: string projectId: section: basic title: Project ID type: string totalMBytesProcessedLimit: anyOf: - type: integer - type: 'null' section: config title: Scanned Data Limit (MB) useStandardSql: default: true section: config title: Use Standard SQL type: boolean userDefinedFunctionResourceUri: anyOf: - type: string - type: 'null' examples: - gs://bucket/date_utils.js section: config title: UDF Source URIs required: - projectId - jsonKeyFile title: BigQueryConfig type: object DatabricksConfig: properties: database: anyOf: - type: string - type: 'null' title: Database host: maxLength: 128 title: Host type: string http_password: format: password title: Access Token type: string writeOnly: true http_path: default: '' title: HTTP Path type: string oauth_dwh_client_id: anyOf: - type: string - type: 'null' title: Oauth Dwh Client Id oauth_dwh_client_secret: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Oauth Dwh Client Secret required: - host - http_password title: DatabricksConfig type: object DuckDBConfig: properties: {} title: DuckDBConfig type: object MongoDBConfig: properties: auth_source: anyOf: - type: string - type: 'null' default: admin title: Auth Source connect_timeout_ms: default: 60000 title: Connect Timeout Ms type: integer database: title: Database type: string host: maxLength: 128 title: Host type: string password: format: password title: Password type: string writeOnly: true port: default: 27017 title: Port type: integer server_selection_timeout_ms: default: 60000 title: Server Selection Timeout Ms type: integer socket_timeout_ms: default: 300000 title: Socket Timeout Ms type: integer username: title: Username type: string required: - database - username - password - host title: MongoDBConfig type: object MySQLConfig: properties: db: title: Database name type: string host: maxLength: 128 title: Host type: string password: format: password title: Password type: string writeOnly: true port: default: 3306 title: Port type: integer user: title: User type: string required: - host - user - password - db title: MySQLConfig type: object MariaDBConfig: description: |- Configuration for MariaDB connections. MariaDB is MySQL-compatible, so we reuse the MySQL configuration. Default port is 3306, same as MySQL. properties: db: title: Database name type: string host: maxLength: 128 title: Host type: string password: format: password title: Password type: string writeOnly: true port: default: 3306 title: Port type: integer user: title: User type: string required: - host - user - password - db title: MariaDBConfig type: object MSSQLConfig: properties: dbname: anyOf: - type: string - type: 'null' title: Dbname host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 1433 title: Port type: integer require_encryption: default: true title: Require Encryption type: boolean session_script: anyOf: - type: string - type: 'null' description: >- The script to execute on connection; e.g. ALTER SESSION SET CONTAINER = ... title: Init script trust_server_certificate: default: false title: Trust Server Certificate type: boolean user: default: DATAFOLD title: User type: string required: - host title: MSSQLConfig type: object OracleConfig: properties: database: anyOf: - type: string - type: 'null' title: Database database_type: anyOf: - enum: - service - sid type: string - type: 'null' title: Database Type ewallet_password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: EWallet password ewallet_pem_file: anyOf: - format: password type: string writeOnly: true - type: 'null' title: EWallet PEM ewallet_pkcs12_file: anyOf: - format: password type: string writeOnly: true - type: 'null' title: EWallet PKCS12 ewallet_type: anyOf: - enum: - x509 - pkcs12 type: string - type: 'null' title: Ewallet Type host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: anyOf: - type: integer - type: 'null' title: Port session_script: anyOf: - type: string - type: 'null' description: >- The script to execute on connection; e.g. ALTER SESSION SET CONTAINER = ... title: Init script ssl: default: false title: Ssl type: boolean ssl_server_dn: anyOf: - type: string - type: 'null' description: 'e.g. C=US,O=example,CN=db.example.com; default: CN=' title: Server's SSL DN user: default: DATAFOLD title: User type: string required: - host title: OracleConfig type: object PostgreSQLConfig: properties: dbname: title: Database Name type: string host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 5432 title: Port type: integer role: anyOf: - type: string - type: 'null' title: Role (case sensitive) rootcert: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Root certificate sslmode: $ref: '#/components/schemas/SslMode' default: prefer title: SSL Mode user: title: User type: string required: - host - user - dbname title: PostgreSQLConfig type: object PostgreSQLAuroraConfig: properties: aws_access_key_id: anyOf: - type: string - type: 'null' title: AWS Access Key aws_cloudwatch_log_group: anyOf: - type: string - type: 'null' title: Cloudwatch Postgres Log Group aws_region: anyOf: - type: string - type: 'null' title: AWS Region aws_secret_access_key: anyOf: - format: password type: string writeOnly: true - type: 'null' title: AWS Secret dbname: title: Database Name type: string host: maxLength: 128 title: Host type: string keep_alive: anyOf: - type: integer - type: 'null' title: Keep Alive timeout in seconds, leave empty to disable password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 5432 title: Port type: integer role: anyOf: - type: string - type: 'null' title: Role (case sensitive) rootcert: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Root certificate sslmode: $ref: '#/components/schemas/SslMode' default: prefer title: SSL Mode user: title: User type: string required: - host - user - dbname title: PostgreSQLAuroraConfig type: object RedshiftConfig: properties: adhoc_query_group: default: default section: config title: Query Group for Adhoc Queries type: string dbname: title: Database Name type: string host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 5432 title: Port type: integer role: anyOf: - type: string - type: 'null' title: Role (case sensitive) rootcert: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Root certificate scheduled_query_group: default: default section: config title: Query Group for Scheduled Queries type: string sslmode: $ref: '#/components/schemas/SslMode' default: prefer title: SSL Mode user: title: User type: string required: - host - user - dbname title: RedshiftConfig type: object TeradataConfig: properties: database: title: Database type: string host: maxLength: 128 title: Host type: string password: format: password title: Password type: string writeOnly: true port: anyOf: - type: integer - type: 'null' title: Port user: default: DATAFOLD title: User type: string required: - host - password - database title: TeradataConfig type: object SapHanaConfig: properties: host: maxLength: 128 title: Host type: string password: format: password title: Password type: string writeOnly: true port: default: 443 title: Port type: integer user: default: DATAFOLD title: User type: string required: - host - password title: SapHanaConfig type: object AwsAthenaConfig: properties: aws_access_key_id: title: Aws Access Key Id type: string aws_secret_access_key: format: password title: Aws Secret Access Key type: string writeOnly: true catalog: default: awsdatacatalog title: Catalog type: string database: default: default title: Database type: string region: title: Region type: string s3_staging_dir: format: uri minLength: 1 title: S3 Staging Dir type: string required: - aws_access_key_id - aws_secret_access_key - s3_staging_dir - region title: AwsAthenaConfig type: object SnowflakeConfig: properties: account: maxLength: 128 title: Account type: string authMethod: anyOf: - enum: - password - keypair type: string - type: 'null' title: Authmethod data_source_id: anyOf: - type: integer - type: 'null' title: Data Source Id default_db: default: '' examples: - MY_DB title: Default DB (case sensitive) type: string default_schema: default: PUBLIC examples: - PUBLIC section: config title: Default schema (case sensitive) type: string keyPairFile: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Key Pair file (private-key) metadata_database: default: SNOWFLAKE examples: - SNOWFLAKE section: config title: Database containing metadata (usually SNOWFLAKE) type: string oauth_dwh_client_id: anyOf: - type: string - type: 'null' title: Oauth Dwh Client Id oauth_dwh_client_secret: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Oauth Dwh Client Secret password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: anyOf: - type: integer - type: 'null' default: 443 title: Port region: anyOf: - type: string - type: 'null' section: config title: Region role: default: '' examples: - PUBLIC title: Role (case sensitive) type: string sql_variables: anyOf: - type: string - type: 'null' examples: - |- variable_1=10 variable_2=test section: config title: Session variables applied at every connection. widget: multiline user: default: DATAFOLD title: User type: string user_id: anyOf: - type: integer - type: 'null' title: User Id warehouse: default: '' examples: - COMPUTE_WH title: Warehouse (case sensitive) type: string required: - account title: SnowflakeConfig type: object DremioConfig: properties: certcheck: anyOf: - $ref: '#/components/schemas/CertCheck' - type: 'null' default: dremio-cloud title: Certificate check customcert: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Custom certificate host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 443 title: Port type: integer project_id: anyOf: - type: string - type: 'null' title: Project id role: anyOf: - type: string - type: 'null' title: Role (case sensitive) tls: default: false title: Encryption type: boolean token: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Token username: anyOf: - type: string - type: 'null' title: User ID (optional) view_temp_schema: anyOf: - type: string - type: 'null' title: Temporary schema for views required: - host title: DremioConfig type: object StarburstConfig: properties: host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 443 title: Port type: integer tls: default: true title: Encryption type: boolean token: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Token username: anyOf: - type: string - type: 'null' title: User ID (optional) required: - host title: StarburstConfig type: object NetezzaConfig: properties: database: maxLength: 128 title: Database type: string host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 5480 title: Port type: integer tls: default: true title: Encryption type: boolean username: anyOf: - type: string - type: 'null' title: User ID (optional) required: - host - database title: NetezzaConfig type: object AzureDataLakeConfig: properties: account_name: title: Account Name type: string client_id: anyOf: - maxLength: 1024 type: string - type: 'null' title: Client Id client_secret: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Client Secret materialize_max_rows: anyOf: - type: integer - type: 'null' title: Materialize Max Rows materialize_path: anyOf: - type: string - type: 'null' title: Materialize Path tenant_id: anyOf: - maxLength: 1024 type: string - type: 'null' title: Tenant Id required: - account_name - tenant_id - client_id title: AzureDataLakeConfig type: object GCSConfig: properties: bucket_name: title: Bucket Name type: string bucket_region: title: Bucket Region type: string jsonKeyFile: format: password section: basic title: JSON Key File type: string writeOnly: true materialize_max_rows: anyOf: - type: integer - type: 'null' title: Materialize Max Rows materialize_path: anyOf: - type: string - type: 'null' title: Materialize Path required: - bucket_name - jsonKeyFile - bucket_region title: GCSConfig type: object AWSS3Config: properties: bucket_name: title: Bucket Name type: string key_id: anyOf: - maxLength: 1024 type: string - type: 'null' title: Key Id materialize_max_rows: anyOf: - type: integer - type: 'null' title: Materialize Max Rows materialize_path: anyOf: - type: string - type: 'null' title: Materialize Path region: title: Region type: string secret: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Secret required: - bucket_name - key_id - region title: AWSS3Config type: object MicrosoftFabricConfig: properties: client_id: description: Microsoft Entra ID Application (Client) ID title: Application (Client) ID type: string client_secret: description: Microsoft Entra ID Application Client Secret format: password title: Client Secret type: string writeOnly: true dbname: title: Dbname type: string host: maxLength: 128 title: Host type: string session_script: anyOf: - type: string - type: 'null' description: >- The script to execute on connection; e.g. ALTER SESSION SET CONTAINER = ... title: Init script tenant_id: description: Microsoft Entra ID Tenant ID title: Tenant ID type: string required: - host - dbname - tenant_id - client_id - client_secret title: MicrosoftFabricConfig type: object VerticaConfig: properties: dbname: title: Database Name type: string host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 5433 title: Port type: integer role: anyOf: - type: string - type: 'null' title: Role (case sensitive) sslmode: $ref: '#/components/schemas/SslMode' default: prefer title: SSL Mode user: title: User type: string required: - host - user - dbname title: VerticaConfig type: object TrinoConfig: properties: dbname: title: Catalog Name type: string hive_timestamp_precision: anyOf: - enum: - 3 - 6 - 9 type: integer - type: 'null' description: 'Optional: Timestamp precision if using Hive connector' title: Hive Timestamp Precision host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 8080 title: Port type: integer ssl_verification: $ref: '#/components/schemas/SSLVerification' default: full title: SSL Verification tls: default: true title: Encryption type: boolean user: title: User type: string required: - host - user - dbname title: TrinoConfig type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object TestResultStep: properties: result: anyOf: - {} - type: 'null' title: Result status: $ref: '#/components/schemas/JobStatus' step: $ref: '#/components/schemas/ConfigurationCheckStep' required: - step - status title: TestResultStep type: object SslMode: description: >- SSL mode for database connections (used by PostgreSQL, Vertica, Redshift, etc.) enum: - prefer - require - verify-ca - verify-full title: SslMode type: string CertCheck: enum: - disable - dremio-cloud - customcert title: CertCheck type: string SSLVerification: enum: - full - none - ca title: SSLVerification type: string JobStatus: enum: - needs_confirmation - needs_authentication - waiting - processing - done - failed - cancelled title: JobStatus type: string ConfigurationCheckStep: enum: - connection - temp_schema - schema_download - lineage_download title: ConfigurationCheckStep type: string securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/monitors/create-a-data-test-monitor.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Create a Data Test Monitor ## OpenAPI ````yaml openapi-public.json post /api/v1/monitors/create/test openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/monitors/create/test: post: tags: - Monitors summary: Create a Data Test Monitor operationId: create_monitor_test_api_v1_monitors_create_test_post requestBody: content: application/json: schema: $ref: '#/components/schemas/DataTestMonitorSpecPublic' required: true responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiPublicCreateMonitorOut' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: DataTestMonitorSpecPublic: properties: connection_id: description: The identifier for the data source configuration. title: Connection Id type: integer description: anyOf: - type: string - type: 'null' description: The description of the monitor. title: Description enabled: default: true description: Indicates whether the monitor is enabled. title: Enabled type: boolean name: description: The name of the monitor. title: Name type: string notifications: description: Notification configuration for the monitor. items: discriminator: mapping: email: '#/components/schemas/EmailNotification' pagerduty: '#/components/schemas/PagerDutyNotification' slack: '#/components/schemas/SlackNotification' teams: '#/components/schemas/TeamsNotification' webhook: '#/components/schemas/WebhookNotification' propertyName: type oneOf: - $ref: '#/components/schemas/EmailNotification' - $ref: '#/components/schemas/PagerDutyNotification' - $ref: '#/components/schemas/WebhookNotification' - $ref: '#/components/schemas/SlackNotification' - $ref: '#/components/schemas/TeamsNotification' title: Notifications type: array query: anyOf: - type: string - type: 'null' description: The SQL query to be evaluated. title: Query schedule: anyOf: - $ref: '#/components/schemas/IntervalSchedule' - $ref: '#/components/schemas/CronSchedule' - $ref: '#/components/schemas/NoneSchedule' description: The schedule at which the monitor runs. tags: description: Tags associated with the monitor. items: type: string title: Tags type: array test: anyOf: - $ref: '#/components/schemas/StandardDataTestMonitorSpec' - type: 'null' required: - schedule - name - connection_id title: DataTestMonitorSpecPublic type: object ApiPublicCreateMonitorOut: properties: id: description: Unique identifier for the monitor. title: Id type: integer required: - id title: ApiPublicCreateMonitorOut type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object EmailNotification: properties: features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features recipients: description: A list of email addresses to receive the notification. items: type: string title: Recipients type: array type: const: email default: email title: Type type: string required: - recipients title: Email type: object PagerDutyNotification: properties: features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer type: const: pagerduty default: pagerduty title: Type type: string required: - integration title: PagerDuty type: object WebhookNotification: properties: features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer type: const: webhook default: webhook title: Type type: string required: - integration title: Webhook type: object SlackNotification: properties: channel: description: The channel through which the notification will be sent. title: Channel type: string features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer mentions: description: A list of mentions to include in the notification. items: type: string title: Mentions type: array type: const: slack default: slack title: Type type: string required: - integration - channel title: Slack type: object TeamsNotification: properties: channel: description: The channel through which the notification will be sent. title: Channel type: string features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer mentions: description: A list of mentions names to include in the notification. items: type: string title: Mentions type: array type: const: teams default: teams title: Type type: string required: - integration - channel title: Teams type: object IntervalSchedule: properties: interval: anyOf: - $ref: '#/components/schemas/HourIntervalSchedule' - $ref: '#/components/schemas/DayIntervalSchedule' description: Specifies the scheduling interval. required: - interval title: Interval type: object CronSchedule: properties: cron: description: The cron expression that defines the schedule. title: Cron type: string type: const: crontab default: crontab title: Type type: string required: - cron title: Cron type: object NoneSchedule: properties: type: const: none default: none title: Type type: string title: None type: object StandardDataTestMonitorSpec: properties: tables: anyOf: - items: $ref: '#/components/schemas/SDTTable' type: array - type: 'null' title: Tables type: $ref: '#/components/schemas/StandardDataTestTypes' variables: anyOf: - additionalProperties: $ref: '#/components/schemas/SDTVariable' type: object - type: 'null' title: Variables required: - type title: Standard DT type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object DestinationFeatures: enum: - attach_csv - notify_first_triggered_only - disable_recovery_notifications - notify_every_run title: DestinationFeatures type: string HourIntervalSchedule: properties: every: const: hour title: Every type: string type: const: hourly default: hourly title: Type type: string required: - every title: Hour type: object DayIntervalSchedule: properties: every: const: day title: Every type: string hour: anyOf: - type: integer - type: 'null' description: The hour at which the monitor should trigger. (0 - 23) title: Hour type: const: daily default: daily title: Type type: string utc_at: anyOf: - format: time type: string - type: 'null' description: The UTC time at which the monitor should trigger. title: Utc At required: - every title: Day type: object SDTTable: properties: columns: items: type: string title: Columns type: array path: title: Path type: string required: - path - columns title: SDTTable type: object StandardDataTestTypes: enum: - unique - not_null - accepted_values - referential_integrity - numeric_range - custom_template title: StandardDataTestTypes type: string SDTVariable: properties: quote: default: true title: Quote type: boolean value: anyOf: - type: string - type: integer - type: number - items: type: string type: array - items: type: integer type: array - items: type: number type: array - items: anyOf: - type: string - type: integer - type: number type: array title: Value required: - value title: SDTVariable type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/bi/create-a-dbt-bi-integration.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Create a DBT BI integration ## OpenAPI ````yaml post /api/v1/lineage/bi/dbt/ openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineage/bi/dbt/: post: tags: - BI - bi_added summary: Create a DBT BI integration operationId: create_dbt_integration_api_v1_lineage_bi_dbt__post requestBody: content: application/json: schema: $ref: '#/components/schemas/DbtDataSourceConfig' required: true responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: DbtDataSourceConfig: properties: ci_config_id: title: Ci Config Id type: integer indexing_cron: anyOf: - type: string - type: 'null' title: Indexing Cron name: anyOf: - type: string - type: 'null' title: Name required: - ci_config_id title: DbtDataSourceConfig type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/bi/create-a-hightouch-integration.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Create a Hightouch integration ## OpenAPI ````yaml post /api/v1/lineage/bi/hightouch/ openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineage/bi/hightouch/: post: tags: - BI - bi_added summary: Create a Hightouch integration operationId: create_hightouch_integration_api_v1_lineage_bi_hightouch__post requestBody: content: application/json: schema: $ref: '#/components/schemas/HighTouchDataSourceConfig' required: true responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: HighTouchDataSourceConfig: properties: bindings: items: $ref: '#/components/schemas/DataSourceBinding' title: Bindings type: array indexing_cron: anyOf: - type: string - type: 'null' title: Indexing Cron name: anyOf: - type: string - type: 'null' title: Name token: format: password title: Token type: string writeOnly: true workspace: anyOf: - type: string - type: 'null' title: Workspace required: - token - bindings title: HighTouchDataSourceConfig type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object DataSourceBinding: properties: boundIds: items: type: integer title: Boundids type: array remoteId: title: Remoteid type: string required: - remoteId - boundIds title: DataSourceBinding type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/bi/create-a-looker-integration.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Create a Looker integration ## OpenAPI ````yaml post /api/v1/lineage/bi/looker/ openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineage/bi/looker/: post: tags: - BI - bi_added summary: Create a Looker integration operationId: create_looker_integration_api_v1_lineage_bi_looker__post requestBody: content: application/json: schema: $ref: '#/components/schemas/LookerDataSourceConfig' required: true responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: LookerDataSourceConfig: properties: base_url: title: Base Url type: string bindings: default: [] items: $ref: '#/components/schemas/DataSourceBinding' title: Bindings type: array client_id: title: Client Id type: string client_secret: format: password title: Client Secret type: string writeOnly: true indexing_cron: anyOf: - type: string - type: 'null' title: Indexing Cron name: anyOf: - type: string - type: 'null' title: Name project_ids: default: [] items: type: string title: Project Ids type: array repo_id: title: Repo Id type: integer required: - base_url - client_id - repo_id - client_secret title: LookerDataSourceConfig type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object DataSourceBinding: properties: boundIds: items: type: integer title: Boundids type: array remoteId: title: Remoteid type: string required: - remoteId - boundIds title: DataSourceBinding type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/monitors/create-a-metric-monitor.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Create a Metric Monitor ## OpenAPI ````yaml openapi-public.json post /api/v1/monitors/create/metric openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/monitors/create/metric: post: tags: - Monitors summary: Create a Metric Monitor operationId: create_monitor_metric_api_v1_monitors_create_metric_post requestBody: content: application/json: schema: $ref: '#/components/schemas/MetricMonitorSpecPublic' required: true responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiPublicCreateMonitorOut' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: MetricMonitorSpecPublic: properties: alert: anyOf: - discriminator: mapping: absolute: '#/components/schemas/AbsoluteThreshold' automatic: '#/components/schemas/AnomalyDetectionThreshold' percentage: '#/components/schemas/PercentageThreshold' propertyName: type oneOf: - $ref: '#/components/schemas/AnomalyDetectionThreshold' - $ref: '#/components/schemas/AbsoluteThreshold' - $ref: '#/components/schemas/PercentageThreshold' - type: 'null' description: Condition for triggering alerts. connection_id: description: The identifier for the data source configuration. title: Connection Id type: integer description: anyOf: - type: string - type: 'null' description: The description of the monitor. title: Description enabled: default: true description: Indicates whether the monitor is enabled. title: Enabled type: boolean metric: description: Configuration for the metric being monitored. discriminator: mapping: column: '#/components/schemas/ColumnMetricMonitorConfig' custom: '#/components/schemas/CustomMetricMonitorConfig' table: '#/components/schemas/BaseTableMetricMonitorConfig' propertyName: type oneOf: - $ref: '#/components/schemas/BaseTableMetricMonitorConfig' - $ref: '#/components/schemas/ColumnMetricMonitorConfig' - $ref: '#/components/schemas/CustomMetricMonitorConfig' name: description: The name of the monitor. title: Name type: string notifications: description: Notification configuration for the monitor. items: discriminator: mapping: email: '#/components/schemas/EmailNotification' pagerduty: '#/components/schemas/PagerDutyNotification' slack: '#/components/schemas/SlackNotification' teams: '#/components/schemas/TeamsNotification' webhook: '#/components/schemas/WebhookNotification' propertyName: type oneOf: - $ref: '#/components/schemas/EmailNotification' - $ref: '#/components/schemas/PagerDutyNotification' - $ref: '#/components/schemas/WebhookNotification' - $ref: '#/components/schemas/SlackNotification' - $ref: '#/components/schemas/TeamsNotification' title: Notifications type: array schedule: anyOf: - $ref: '#/components/schemas/IntervalSchedule' - $ref: '#/components/schemas/CronSchedule' - $ref: '#/components/schemas/NoneSchedule' description: The schedule at which the monitor runs. tags: description: Tags associated with the monitor. items: type: string title: Tags type: array required: - schedule - name - connection_id - metric title: MetricMonitorSpecPublic type: object ApiPublicCreateMonitorOut: properties: id: description: Unique identifier for the monitor. title: Id type: integer required: - id title: ApiPublicCreateMonitorOut type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object AnomalyDetectionThreshold: properties: sensitivity: description: Sensitivity level for anomaly detection, ranging from 0 to 100. maximum: 100 minimum: 0 title: Sensitivity type: integer type: const: automatic title: Type type: string required: - type - sensitivity title: Anomaly Detection type: object AbsoluteThreshold: properties: max: anyOf: - type: number - type: 'null' description: Maximum value for the absolute threshold. title: Max min: anyOf: - type: number - type: 'null' description: Minimum value for the absolute threshold. title: Min type: const: absolute title: Type type: string required: - type title: Absolute type: object PercentageThreshold: properties: decrease: anyOf: - type: number - type: integer - type: 'null' description: Threshold for allowable percentage decrease. title: Decrease increase: anyOf: - type: number - type: integer - type: 'null' description: Threshold for allowable percentage increase. title: Increase type: const: percentage title: Type type: string required: - type title: Percentage type: object BaseTableMetricMonitorConfig: properties: filter: anyOf: - type: string - type: 'null' description: Filter condition to evaluate. title: Filter metric: anyOf: - $ref: '#/components/schemas/TableMetricAlias' description: The table metric configuration. table: anyOf: - type: string - type: 'null' description: The name of the table. title: Table type: const: table default: table title: Type type: string required: - metric title: Table type: object ColumnMetricMonitorConfig: properties: column: description: The column of the table. title: Column type: string filter: anyOf: - type: string - type: 'null' description: Filter condition to evaluate. title: Filter metric: anyOf: - $ref: '#/components/schemas/ColumnMetricAlias' description: The column metric configuration. title: Metric table: description: The name of the table. title: Table type: string type: const: column default: column title: Type type: string required: - table - column - metric title: Column type: object CustomMetricMonitorConfig: properties: alert_on_missing_data: default: false description: Trigger alert if query returns unexpectedly few data points. title: Alert On Missing Data type: boolean query: description: The SQL query to be evaluated. title: Query type: string type: const: custom default: custom title: Type type: string required: - query title: Custom type: object EmailNotification: properties: features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features recipients: description: A list of email addresses to receive the notification. items: type: string title: Recipients type: array type: const: email default: email title: Type type: string required: - recipients title: Email type: object PagerDutyNotification: properties: features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer type: const: pagerduty default: pagerduty title: Type type: string required: - integration title: PagerDuty type: object WebhookNotification: properties: features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer type: const: webhook default: webhook title: Type type: string required: - integration title: Webhook type: object SlackNotification: properties: channel: description: The channel through which the notification will be sent. title: Channel type: string features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer mentions: description: A list of mentions to include in the notification. items: type: string title: Mentions type: array type: const: slack default: slack title: Type type: string required: - integration - channel title: Slack type: object TeamsNotification: properties: channel: description: The channel through which the notification will be sent. title: Channel type: string features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer mentions: description: A list of mentions names to include in the notification. items: type: string title: Mentions type: array type: const: teams default: teams title: Type type: string required: - integration - channel title: Teams type: object IntervalSchedule: properties: interval: anyOf: - $ref: '#/components/schemas/HourIntervalSchedule' - $ref: '#/components/schemas/DayIntervalSchedule' description: Specifies the scheduling interval. required: - interval title: Interval type: object CronSchedule: properties: cron: description: The cron expression that defines the schedule. title: Cron type: string type: const: crontab default: crontab title: Type type: string required: - cron title: Cron type: object NoneSchedule: properties: type: const: none default: none title: Type type: string title: None type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object TableMetricAlias: enum: - row_count - freshness title: TableMetricAlias type: string ColumnMetricAlias: enum: - minimum - maximum - std_dev - cardinality - uniqueness - median - average - sum - fill_rate title: ColumnMetricAlias type: string DestinationFeatures: enum: - attach_csv - notify_first_triggered_only - disable_recovery_notifications - notify_every_run title: DestinationFeatures type: string HourIntervalSchedule: properties: every: const: hour title: Every type: string type: const: hourly default: hourly title: Type type: string required: - every title: Hour type: object DayIntervalSchedule: properties: every: const: day title: Every type: string hour: anyOf: - type: integer - type: 'null' description: The hour at which the monitor should trigger. (0 - 23) title: Hour type: const: daily default: daily title: Type type: string utc_at: anyOf: - format: time type: string - type: 'null' description: The UTC time at which the monitor should trigger. title: Utc At required: - every title: Day type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/bi/create-a-mode-analytics-integration.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Create a Mode Analytics integration ## OpenAPI ````yaml post /api/v1/lineage/bi/mode/ openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineage/bi/mode/: post: tags: - BI - bi_added summary: Create a Mode Analytics integration operationId: create_mode_integration_api_v1_lineage_bi_mode__post requestBody: content: application/json: schema: $ref: '#/components/schemas/ModeDataSourceConfig' required: true responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ModeDataSourceConfig: properties: indexing_cron: anyOf: - type: string - type: 'null' title: Indexing Cron name: anyOf: - type: string - type: 'null' title: Name password: format: password title: Password type: string writeOnly: true token: format: password title: Token type: string writeOnly: true workspace: default: '' title: Workspace type: string required: - token - password title: ModeDataSourceConfig type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/bi/create-a-power-bi-integration.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Create a Power BI integration ## OpenAPI ````yaml openapi-public.json post /api/v1/lineage/bi/powerbi/ openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineage/bi/powerbi/: post: tags: - BI - BI - bi_added summary: Create a Power BI integration operationId: create_powerbi_integration_api_v1_lineage_bi_powerbi__post requestBody: content: application/json: schema: $ref: '#/components/schemas/PowerBIDataSourceConfig' required: true responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: PowerBIDataSourceConfig: description: Power BI data source parameters. properties: auth_type: anyOf: - $ref: '#/components/schemas/PowerBIAuthType' - type: 'null' client_id: anyOf: - type: string - type: 'null' title: Client Id client_secret: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Client Secret indexing_cron: anyOf: - type: string - type: 'null' title: Indexing Cron name: anyOf: - type: string - type: 'null' title: Name tenant_id: anyOf: - type: string - type: 'null' title: Tenant Id title: PowerBIDataSourceConfig type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object PowerBIAuthType: enum: - delegated - service_principal title: PowerBIAuthType type: string ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/monitors/create-a-schema-change-monitor.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Create a Schema Change Monitor ## OpenAPI ````yaml openapi-public.json post /api/v1/monitors/create/schema openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/monitors/create/schema: post: tags: - Monitors summary: Create a Schema Change Monitor operationId: create_monitor_schema_api_v1_monitors_create_schema_post requestBody: content: application/json: schema: $ref: '#/components/schemas/SchemaChangeMonitorSpecPublic' required: true responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiPublicCreateMonitorOut' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: SchemaChangeMonitorSpecPublic: properties: connection_id: description: The identifier for the data source configuration. title: Connection Id type: integer description: anyOf: - type: string - type: 'null' description: The description of the monitor. title: Description enabled: default: true description: Indicates whether the monitor is enabled. title: Enabled type: boolean name: description: The name of the monitor. title: Name type: string notifications: description: Notification configuration for the monitor. items: discriminator: mapping: email: '#/components/schemas/EmailNotification' pagerduty: '#/components/schemas/PagerDutyNotification' slack: '#/components/schemas/SlackNotification' teams: '#/components/schemas/TeamsNotification' webhook: '#/components/schemas/WebhookNotification' propertyName: type oneOf: - $ref: '#/components/schemas/EmailNotification' - $ref: '#/components/schemas/PagerDutyNotification' - $ref: '#/components/schemas/WebhookNotification' - $ref: '#/components/schemas/SlackNotification' - $ref: '#/components/schemas/TeamsNotification' title: Notifications type: array schedule: anyOf: - $ref: '#/components/schemas/IntervalSchedule' - $ref: '#/components/schemas/CronSchedule' - $ref: '#/components/schemas/NoneSchedule' description: The schedule at which the monitor runs. table: anyOf: - type: string - type: 'null' description: The name of the table. title: Table tags: description: Tags associated with the monitor. items: type: string title: Tags type: array required: - schedule - name - connection_id title: SchemaChangeMonitorSpecPublic type: object ApiPublicCreateMonitorOut: properties: id: description: Unique identifier for the monitor. title: Id type: integer required: - id title: ApiPublicCreateMonitorOut type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object EmailNotification: properties: features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features recipients: description: A list of email addresses to receive the notification. items: type: string title: Recipients type: array type: const: email default: email title: Type type: string required: - recipients title: Email type: object PagerDutyNotification: properties: features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer type: const: pagerduty default: pagerduty title: Type type: string required: - integration title: PagerDuty type: object WebhookNotification: properties: features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer type: const: webhook default: webhook title: Type type: string required: - integration title: Webhook type: object SlackNotification: properties: channel: description: The channel through which the notification will be sent. title: Channel type: string features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer mentions: description: A list of mentions to include in the notification. items: type: string title: Mentions type: array type: const: slack default: slack title: Type type: string required: - integration - channel title: Slack type: object TeamsNotification: properties: channel: description: The channel through which the notification will be sent. title: Channel type: string features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer mentions: description: A list of mentions names to include in the notification. items: type: string title: Mentions type: array type: const: teams default: teams title: Type type: string required: - integration - channel title: Teams type: object IntervalSchedule: properties: interval: anyOf: - $ref: '#/components/schemas/HourIntervalSchedule' - $ref: '#/components/schemas/DayIntervalSchedule' description: Specifies the scheduling interval. required: - interval title: Interval type: object CronSchedule: properties: cron: description: The cron expression that defines the schedule. title: Cron type: string type: const: crontab default: crontab title: Type type: string required: - cron title: Cron type: object NoneSchedule: properties: type: const: none default: none title: Type type: string title: None type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object DestinationFeatures: enum: - attach_csv - notify_first_triggered_only - disable_recovery_notifications - notify_every_run title: DestinationFeatures type: string HourIntervalSchedule: properties: every: const: hour title: Every type: string type: const: hourly default: hourly title: Type type: string required: - every title: Hour type: object DayIntervalSchedule: properties: every: const: day title: Every type: string hour: anyOf: - type: integer - type: 'null' description: The hour at which the monitor should trigger. (0 - 23) title: Hour type: const: daily default: daily title: Type type: string utc_at: anyOf: - format: time type: string - type: 'null' description: The UTC time at which the monitor should trigger. title: Utc At required: - every title: Day type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/bi/create-a-tableau-integration.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Create a Tableau integration ## OpenAPI ````yaml post /api/v1/lineage/bi/tableau/ openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineage/bi/tableau/: post: tags: - BI - bi_added summary: Create a Tableau integration operationId: create_tableau_integration_api_v1_lineage_bi_tableau__post requestBody: content: application/json: schema: $ref: '#/components/schemas/TableauDataSourceConfig' required: true responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: TableauDataSourceConfig: properties: indexing_cron: anyOf: - type: string - type: 'null' title: Indexing Cron name: anyOf: - type: string - type: 'null' title: Name server_url: title: Server Url type: string site_id: title: Site Id type: string token_name: title: Token Name type: string token_value: format: password title: Token Value type: string writeOnly: true required: - token_name - token_value - site_id - server_url title: TableauDataSourceConfig type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/data-diff/in-database-diffing/creating-a-new-data-diff.md # Source: https://docs.datafold.com/data-diff/cross-database-diffing/creating-a-new-data-diff.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Creating a New Data Diff > Datafold's Data Diff can compare data across databases (e.g., PostgreSQL <> Snowflake, or between two SQL Server instances) to validate migrations, meet regulatory and compliance requirements, or ensure data is flowing successfully from source to target. This powerful algorithm provides full row-, column-, and value-level detail into discrepancies between data tables. ## Creating a new data diff Setting up a new data diff in Datafold is straightforward. You can configure your data diffs with the following parameters and options: ### Source and Target datasets #### Data connection Pick your data connection(s). #### Diff type Choose how you want to compare your data: * Table: Select this to compare data directly from database tables * Query: Use this to compare results from specific SQL queries #### Dataset Choose the dataset you want to compare. This can be a table or a view in your relational database. #### Filter Insert your filter clause after the WHERE keyword to refine your dataset. For example: `created_at > '2000-01-01'` will only include data created after January 1, 2000. ### Materialize inputs Select this option to improve diffing speed when query is heavy on compute, or if filters are applied to non-indexed columns, or if primary keys are transformed using concatenation, coalesce, or another function. ## Column remapping Designate columns with the same data type and different column names to be compared. Data Diff will surface differences under the column name used in the Source dataset. Datafold automatically handles differences in data types to ensure accurate comparisons. See our best practices below for how this is handled. ## General ### Primary key The primary key is one or more columns used to uniquely identify a row in the dataset during diffing. The primary key (or keys) does not need to be formally defined in the database or elsewhere as it is used for unique row identification during diffing. Textual primary keys do not support values outside the set of characters `a-zA-Z0-9!"()*/^+-<>=`. If these values exist, we recommend filtering them out before running the diff operation. ### Columns #### Columns to compare Specify which columns to compare between datasets. Note that this has performance implications when comparing a large number of columns, especially in wide tables with 30 or more columns. It is recommended to initially focus on comparisons using only the primary key or to select a limited subset of columns. ### Row sampling Use sampling to compare a subset of your data instead of the entire dataset. This is best for diffing large datasets. Sampling can be configured to select a percentage of rows to compare, or to ensure differences are found to a chosen degree of statistical confidence. #### Sampling tolerance Sampling tolerance defines the allowable margin of error for our estimate. It sets the acceptable percentage of rows with primary key errors (e.g., nulls, duplicates, or primary keys exclusive to one dataset) before disabling sampling. When sampling is enabled, not every row is examined, which introduces a probability of missing certain discrepancies. This threshold represents the level of difference we are willing to accept before considering the results unreliable and thereby disabling sampling. It essentially sets a limit on how much variance is tolerable in the sample compared to the complete dataset. Default: 0.001% #### Sampling confidence Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset. It represents the minimum confidence level that the rate of primary key errors is below the threshold defined in sampling tolerance. To put it simply, a 95% confidence level with a 5% tolerance means we are 95% certain that the true value falls within 5% of our estimate. Default: 99% #### Sampling threshold Sampling is automatically disabled when the total row count of the largest table in the comparison falls below a specified threshold value. This approach is adopted because, for smaller datasets, a complete dataset comparison is not only more feasible but also quicker and more efficient than sampling. Disabling sampling in these scenarios ensures comprehensive data coverage and provides more accurate insights, as it becomes practical to examine every row in the dataset without significant time or resource constraints. #### Sample size This provides an estimated count of the total number of rows included in the combined sample from Datasets A and B, used for the diffing process. It's important to note that this number is an estimate and can vary from the actual sample size due to several factors: The presence of duplicate primary keys in the datasets will likely increase this estimate, as it inflates the perceived uniqueness of rows. * Applying filters to the datasets tends to reduce the estimate, as it narrows down the data scope. * The number of rows we sample is not fixed; instead, we use a statistical approach called the Poisson distribution. This involves picking rows randomly from an infinite pool of rows with uniform random sampling. Importantly, we don't need to perform a full diff (compare every single row) to establish a baseline. Example: Imagine there are two datasets we want to compare, Source and Target. Since we prefer not to check every row, we use a statistical approach to determine the number of rows to sample from each dataset. To do so, we set the following parameters: * Sampling tolerance: 5% * Sampling confidence: 95% Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset, while sampling tolerance defines the allowable margin of error for our estimate. Here, with a 95% sampling confidence and a 5% sampling tolerance, we are 95% confident that the true value falls within 5% of our estimate. Datafold will then estimate the sample size needed (e.g., 200 rows) to achieve these parameters. ### Advanced #### Materialize diff results to table Create a detailed table from your diff results, indicating each row where differences occur. This table will include corresponding values from both datasets and flags showing whether each row matches or mismatches. --- # Source: https://docs.datafold.com/data-migration-automation/cross-database-diffing-migrations.md # Cross-Database Diffing for Migrations > Validate migration parity with Datafold's cross-database diffing solution. When migrating data from one system to another, ensuring that the data is accurately transferred and remains consistent is critical. Datafold’s cross-database diffing provides a robust method to validate parity between the source and target databases. It compares data across databases, identifying discrepancies at the dataset, column, and row levels, ensuring full confidence in your migration process. ## How cross-database diffing works Datafold connects to any SQL source and target databases, similar to how BI tools do. Datasets from both data connections are co-located in a centralized database to execute comparisons and identify specific rows, columns, and values with differences. To perform diffs at massive scale and increased speed, users can apply sampling, filtering, and column selection. ### What kind of information does Datafold output? Datafold’s cross-database diffing will produce the following results: * **High-Level Summary:** * Total number of different rows * Total number of rows (primary keys) that are present in one database but not the other * Aggregate schema differences * **Schema Differences:** Per-column mapping of data types, column order, etc. * **Primary Key Differences:** Sample of specific rows that are present in one database but not the other. * **Value-Level Differences:** Sample of differing column values for each column with identified discrepancies. The full dataset of differences can be downloaded or materialized to the warehouse. ### How does a user run a data diff? Users can run data diffs through the following methods: * Via Datafold’s interactive UI * Via the Datafold API * On a schedule (as a monitor) with optional alerting via Slack, email, PagerDuty, etc. ### Can I run multiple data diffs at the same time? Yes, users can run as many diffs as they would like, with concurrency limited by the underlying database. ### What if my data is changing and replicated live, how can I ensure proper comparison? In such cases, we recommend using watermarking—diffing data within a specified time window of row creation or update (e.g., `updated_at timestamp`). ### What if the data types do not match between source and target? Datafold performs best-effort type matching for cases where deterministic type casting is possible, e.g., comparing `VARCHAR` type with `STRING` type. When automatic type casting without information loss is not possible, the user can define type casting manually using diffing in Query mode. ### Can data diff help if the dataset in the source and target databases has a different shape/schema/column naming? Yes, users can reshape input datasets by writing a SQL query and diffing in Query mode to bring the dataset to a comparable shape. Datafold also supports column remapping for datasets with different column names between tables. ## Learn more To learn more, check out our guide on [how cross-database diffing works](../data-diff/cross-database-diffing/creating-a-new-data-diff) in Datafold, or explore our extensive [FAQ section](../faq/data-migration-automation) covering cross-database diffing and data migration. --- # Source: https://docs.datafold.com/integrations/orchestrators/custom-integrations.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Custom Integrations > Integrate Datafold with your custom orchestration using the Datafold SDK and REST API. To use the Datafold REST API, you should first create a Datafold API key in Settings > Account. ## Install Then, create your virtual environment for Python: ``` > python3 -m venv venv > source venv/bin/activate > pip install --upgrade pip setuptools wheel ``` Now, you're ready to install the Datafold SDK: ``` > pip install datafold-sdk ``` ## Configure Navigate in the Datafold UI to Settings > Integrations > CI. After selecting `datafold-sdk` from the available options, complete configuration with the following information: | Field Name | Description | | ---------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | Repository | Select the repository that generates the webhooks and where pull / merge requests will be raised. | | Data Connection | Select the data connection where the code that is changed in the repository will run. | | Name | An identifier used in Datafold to identify this CI configuration. | | Files to ignore | If defined, the files matching the pattern will be ignored in the PRs. The pattern uses the syntax of .gitignore. Excluded files can be re-included by using the negation; re-included files can be later re-excluded again to narrow down the filter. | | Mark the CI check as failed on errors | If the checkbox is disabled, the errors in the CI runs will be reported back to GitHub/GitLab as successes, to keep the check "green" and not block the PR/MR. By default (enabled), the errors are reported as failures and may prevent PR/MRs from being merged. | | Require the `datafold` label to start CI | When this is selected, the Datafold CI process will only run when the 'datafold' label has been applied. This label needs to be created manually in GitHub or GitLab and the title or name must match 'datafold' exactly. | | Sampling tolerance | The tolerance to apply in sampling for all data diffs. | | Sampling confidence | The confidence to apply when sampling. | | Sampling Threshold | Sampling will be disabled automatically if tables are smaller than specified threshold. If unspecified, default values will be used depending on the Data Connection type. | ## Add commands to your custom orchestration ```bash theme={null} export DATAFOLD_API_KEY=XXXXXXXXX # only needed if your Datafold app url is not app.datafold.com export DATAFOLD_HOST= ``` To submit diffs for a CI run, replace `ci_config_id`, `pr_num`, and `diffs_file` with the appropriate values for your CI configuration ID, pull request number, and the path to your diffs `JSON` file. #### CLI ```bash theme={null} datafold ci submit \ --ci-config-id \ --pr-num \ --diffs \ ``` #### Python ```python theme={null} import os from datafold_sdk.sdk.ci import run_diff api_key = os.environ.get('DATAFOLD_API_KEY') # Only needed if your Datafold app URL is not app.datafold.com host = os.environ.get("DATAFOLD_HOST") run_diff(host=host, api_key=api_key, ci_config_id=, pr_num=, diffs='') ``` ##### Example JSON format for diffs file The `JSON` file should define the production and pull request tables to compare, along with any primary keys and columns to include or exclude in the comparison. ```json theme={null} [ { "prod": "YOUR_PROJECT.PRODUCTION_TABLE_A", "pr": "YOUR_PROJECT.PR_TABLE_NUM", "pk": ["ID"], "include_columns": ["Column1", "Column2"], "exclude_columns": ["Column3"] }, { "prod": "YOUR_PROJECT.PRODUCTION_TABLE_B", "pr": "YOUR_PROJECT.PR_TABLE_NUM", "pk": ["ID"], "include_columns": ["Column1"], "exclude_columns": [] } ] ``` --- # Source: https://docs.datafold.com/data-monitoring/monitors/data-diff-monitors.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Data Diff Monitors > Data Diff monitors compare datasets across or within databases, identifying row and column discrepancies with customizable scheduling and notifications. ## Ways to create a data diff monitor There are 3 ways to create a data diff monitor: 1. From the **Monitors** page by clicking **Create new monitor** and then selecting **Data diff** as a type of monitor. 2. Clone an existing monitor by clicking **Actions** and then **Clone** in the header menu. This will pre-fill the form with the existing monitor configuration. 3. Create a monitor directly from the data diff results by clicking **Actions** and **Create monitor**. This will pre-fill the configuration with the parent data diff settings, requiring updates only for the **Schedule** and **Notifications** sections. Once a monitor is created and initial metrics collected, you can set up [thresholds](/data-monitoring/monitors/data-diff-monitors#monitoring) for the two metrics. ## Create a new data diff monitor Setting up a new diff monitor in Datafold is straightforward. You can configure it with the following parameters and options: ### General Choose how you want to compare your data and whether the diff type is in-database or cross-database. Pick your data connections. Then, choose the two datasets you want to compare. This can be a table or a view in your relational database. If you need to compare just a subset of data (e.g., for a particular city or last two weeks), add a SQL filter. Select **Materialize inputs** to improve diffing speed when query is heavy on compute, or if filters are applied to non-indexed columns, or if primary keys are transformed using concatenation, coalesce, or another function. ### Column remapping When columns are the same data type but are named differently, column remapping allows you to align and compare them. This is useful when datasets have semantically identical columns with different names, such as `userID` and `user_id`. Datafold will surface any differences under the column name used in Dataset A. ### Diff settings #### Primary key The primary key is one or more columns used to uniquely identify a row in the dataset during diffing. The primary key (or keys) does not need to be formally defined in the database or elsewhere as it is used for unique row identification during diffing. Multiple columns support compound primary key definitions. #### Columns to compare Determine whether to compare all columns or select specific one(s). To optimize performance on large tables, it's recommended to exclude columns known to have unique values for every row, such as timestamp columns like "updated\_at," or apply filters to limit the comparison scope. #### Materialize diff results Choose whether to store diff results in a table. #### Sampling Use this to compare a subset of your data instead of the entire dataset. This is best for assessing large datasets. There are two ways to enable sampling in Monitors: [Tolerance](#tolerance) and [% of Rows](#-of-rows). **TIP** When should I use sampling tolerance instead of percent of rows? Each has its specific use cases and benefits, please [see the FAQ section](#sampling-tolerance-vs--of-rows) for a more detailed breakdown. ##### Tolerance Tolerance defines the allowable margin of error for our estimate. It sets the acceptable percentage of rows with primary key errors (like nulls, duplicates, or primary keys exclusive to one dataset) before disabling sampling. When sampling tolerance is enabled, not every row is examined, which introduces a probability of missing certain discrepancies. This threshold represents the level of difference we are willing to accept before considering the results unreliable and thereby disabling sampling. It essentially sets a limit on how much variance is tolerable in the sample compared to the complete dataset. Default: 0.001% ###### Sampling confidence Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset. It represents the minimum confidence level that the rate of primary key errors is below the threshold defined in sampling tolerance. To put it simply, a 95% confidence level with a 5% tolerance means we are 95% certain that the true value falls within 5% of our estimate. Default: 99% ###### Sampling threshold Sampling will be disabled if total row count of the largest table is less that the threshold value. ###### Sample size This provides an estimated count of the total number of rows included in the combined sample from Datasets A and B, used for the diffing process. It's important to note that this number is an estimate and can vary from the actual sample size due to several factors: * The presence of duplicate primary keys in the datasets will likely increase this estimate, as it inflates the perceived uniqueness of rows * Applying filters to the datasets tends to reduce the estimate, as it narrows down the data scope The number of rows we sample is not fixed; instead, we use a statistical approach called the Poisson distribution. This involves picking rows randomly from an infinite pool of rows with uniform random sampling. Importantly, we don't need to perform a full diff (compare every single row) to establish a baseline. Example: Imagine there are two datasets we want to compare, Main and Test. Since we prefer not to check every row, we use a statistical approach to determine the number of rows to sample from each dataset. To do so, we set the following parameters: * Sampling tolerance: 5% * Sampling confidence: 95% Sampling confidence reflects our level of certainty that our sample accurately represents the entire dataset, while sampling tolerance defines the allowable margin of error for our estimate. Here, with a 95% sampling confidence and a 5% sampling tolerance, we are 95% confident that the true value falls within 5% of our estimate. Datafold will then estimate the sample size needed (e.g., 200 rows) to achieve these parameters. ##### % of rows Percent of rows sampling defines the proportion of the dataset to be included in the sample by specifying a percentage of the total number of rows. For example, setting the sampling percentage to 0.1% means that only 0.1% of the total rows will be sampled for analysis or comparison. When percent of rows sampling is enabled, a fixed percentage of rows is selected randomly from the dataset. This method simplifies the sampling process, making it easy to understand and configure without needing to adjust complex statistical parameters. However, it lacks the statistical assurances provided by methods like sampling tolerance. It doesn't dynamically adjust based on data characteristics or discrepancies but rather adheres strictly to the specified percentage, regardless of the dataset's variability. This straightforward approach is ideal for scenarios where simplicity and quick setup are more important than precision and statistical confidence. It provides a basic yet effective way to estimate the dataset's characteristics or differences, suitable for less critical data validation tasks. ###### Sampling rate This refers to the percentage of the total number of rows in the largest table that will be used to determine the sample size. This ensures that the sample size is proportionate to the size of the dataset, providing a representative subset for comparison. For instance, if the largest table contains 1,000,000 rows and the sampling rate is set to 1%, the sample size will be 10,000 rows. ###### Sampling threshold Sampling is automatically disabled when the total row count of the largest table in the comparison falls below a specified threshold value. This approach is adopted because, for smaller datasets, a complete dataset comparison is not only more feasible but also quicker and more efficient than sampling. Disabling sampling in these scenarios ensures comprehensive data coverage and provides more accurate insights, as it becomes practical to examine every row in the dataset without significant time or resource constraints. ###### Sampling size This parameter is the [same one used in sampling tolerance](#sample-size). ### Add a schedule You can choose to run your monitor daily, hourly, or even input a cron expression for more complex scheduling: ### Add notifications You can add notifications, sent through Slack or emails, which indicate whether a monitor has been executed. Notifications are sent when either or both predefined thresholds are reached during a Diff Monitor. You can set a maximum threshold for the: * Number of different rows * Percentage of different rows ## Results The diff monitor run history shows the results from each run. Each run includes basic stats, along with metrics such as: * The total rows different: number of different rows according to data diff results. * Rows with different values: percentage of different rows relative to the total number of rows in dataset A according to data diff results. Note that the status `Different` doesn't automatically map into a notification/alert. Click the **Open Diff** link for more granular information about a specific Data Diff. ## FAQ Use sampling tolerance when you need statistical confidence in your results, as it is more efficient and stops sampling once a difference is confidently detected. This method is ideal for critical data validation tasks that require precise accuracy. On the other hand, use the percent of rows method for its simplicity and ease of use, especially in less critical scenarios where you just need a straightforward, quick sampling approach without worrying about statistical parameters. This method is perfect for general, easy-to-understand sampling needs. If you have any questions about how to use Data Diff monitors, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com). --- # Source: https://docs.datafold.com/faq/data-diffing.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Data Diffing A [data diff](/data-diff/what-is-data-diff) is a value-level comparison between two tables—used to identify critical changes to your data and guarantee data quality. Similar to how git diff highlights changes in code by comparing different versions of files to show what lines have been added, modified, or deleted, a data diff compares rows and columns in two tables to pinpoint specific data changes. Datafold can compare data in tables, views, and SQL queries in databases and data lakes. Datafold facilitates data diffing by supporting a wide range of basic data types across popular database systems like Snowflake, Databricks, BigQuery, Redshift, and PostgreSQL. Datafold can also diff data across legacy warehouses like Oracle, SQL Server, Teradata, IBM Netezza, MySQL, and more. No, Datafold cannot perform data diffs on unstructured data such as files. However, it supports diffing structured and semi-structured data in tabular formats, including `JSON` columns. When comparing numerical columns or columns of the `FLOAT` type, it is beneficial to [set tolerance levels for differences](/data-diff/in-database-diffing/creating-a-new-data-diff#tolerance-for-floats) to avoid flagging inconsequential discrepancies. This practice ensures that only meaningful differences are highlighted, maintaining the focus on significant changes. When a change is detected, Datafold highlights the differences in the App or through PR comments, allowing data engineers and other users to review, validate, and approve these changes during the CI process. When diffing data within the same physical database or data lake namespace, data diff compares data by executing various SQL queries in the target database. It uses several JOIN-type queries and various aggregate queries to provide detailed insights into differences at the row, value, and column levels, and to calculate differences in metrics and distributions. Datafold connects to any SQL source and target databases, similar to how BI tools do. Datasets from both data connections are co-located in a centralized database to execute comparisons and identify specific rows, columns, and values with differences. To perform diffs at massive scale and increased speed, users can apply sampling, filtering, and column selection. Yes, while the Datafold App UI provides advanced exploration of diff results, you can also materialize these results back to your database. This allows you to further investigate with SQL queries or maintain audit logs, providing flexibility in how you handle and review diff outcomes. Teams may additionally choose to download diff results as a CSV directly from the Datafold App to share with their team members. --- # Source: https://docs.datafold.com/faq/data-migration-automation.md # Data Migration Automation Datafold performs complete SQL codebase translation and validation. It uses an AI agent architecture that performs the translation leveraging an LLM model with a feedback loop optimized for achieving full parity between migration source and target. DMA takes into account metadata, including schema, data types, and relationships in the source system. DMA offers several key advantages over deterministic transpilers that rely on static code parsing with predefined grammars: * **Full parity between source and target:** DMA not only returns code that compiles, but code that produces the same result in your new database with explicit validation. * **Flexible dialect handling:** Ability to adapt to any arbitrary dialect for input/output without the need to provide full grammar, which is especially valuable for numerous legacy systems and their versions. * **Self-correction capabilities:** DMA can self-correct mistakes, taking into account compilation errors and data discrepancies. * **Modernizing code structure:** DMA can convert convoluted stored procedures into dbt projects following best practices. Upon delivery, customers get a comprehensive report with links to data diffs validating parity and discrepancies (if any) on dataset-, column-, and row-level between source and target. Once source and target systems are connected and Datafold ingests the code base, translations with DMA are automatically supervised by the Datafold team. In most cases, no input is required from the customer. Connect source and target data sources to Datafold. Provide Datafold access to the codebase (usually by installing the Datafold GitHub/GitLab/ADO app or via system catalog for stored procedures). Datafold is SOC 2 Type II, GDPR, and HIPAA-compliant and provides flexible deployment options, including in-VPC deployment in AWS, GCP, or Azure. The LLM infrastructure relies on local models and does not expose data to any sub-processor besides the cloud provider. In case of a VPC deployment, none of the data leaves the customer’s private network. After the initial setup, the migration process can take several days to several weeks, depending on the source and target technologies, scale, and complexity. DMA is an ideal fit for lift-and-shift migrations with parity between source and target as the goal. Some customization is possible and needs to be scoped on a case-by-case basis. Datafold connects to any SQL source and target databases, similar to how BI tools do. Datasets from both data connections are co-located in a centralized database to execute comparisons and identify specific rows, columns, and values with differences. To perform diffs at massive scale and increased speed, users can apply sampling, filtering, and column selection. Datafold’s cross-database diffing will produce the following results: * **High-Level Summary:** * Total number of different rows * Total number of rows (primary keys) that are present in one database but not the other * Aggregate schema differences * **Schema Differences:** Per-column mapping of data types, column order, etc. * **Primary Key Differences:** Sample of specific rows that are present in one database but not the other * **Value-Level Differences:** Sample of differing column values for each column with identified discrepancies; full dataset of differences can be downloaded or materialized to the warehouse * Via Datafold’s interactive UI * Via the Datafold API * On schedule (as a monitor) with optional alerting via Slack, email, PagerDuty, etc. Yes, users can run as many diffs as they would like with concurrency limited by the underlying database. In such cases, we recommend using watermarking—diffing data within a specified time window of row creation/update (e.g., `updated_at timestamp`). Datafold performs best-effort type matching for cases where deterministic type casting is possible, e.g., comparing `VARCHAR` type with `STRING` type. When automatic type casting without information loss is not possible, the user can define type casting manually using diffing in Query mode. Users can reshape input datasets by writing a SQL query and diffing in Query mode to bring the dataset to a comparable shape. Datafold also supports column remapping for datasets with different column names between tables. --- # Source: https://docs.datafold.com/faq/data-monitoring-observability.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Data Monitoring and Observability Most data observability tools focus on monitoring metrics (e.g., null counts, row counts) in the data warehouse. But catching data quality issues in the data warehouse is usually too late: the bad data has already affected downstream processes and negatively impacted the business. Our platform focuses on prevention rather than detection of data quality issues. By [integrating deeply into your CI process](/deployment-testing/how-it-works), Datafold's [Data Diff](/data-diff/what-is-data-diff) helps data teams fix potential regressions during development and deployment, before bad code and data get into the production environment. Our [Data Monitors](/data-monitoring/monitor-types) make it easy to monitor production data to catch issues early before they are propagated through the warehouse to business stakeholders. This proactive data quality strategy not only enhances the reliability and accuracy of your data pipelines but also reduces the risk of disruptions and the need for reactive troubleshooting. --- # Source: https://docs.datafold.com/faq/data-reconciliation.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Data Reconciliation Datafold connects to any SQL source and target databases, similar to how BI tools do. Datasets from both data connections are co-located in a centralized database to execute comparisons and identify specific rows, columns, and values with differences. To perform diffs at massive scale and increased speed, users can apply sampling, filtering, and column selection. Datafold’s cross-database diffing will produce the following results: 1. High-Level Summary: * Total number of different rows * Total number of rows (primary keys) that are present in one database, but not the other * Aggregate schema differences 2. Schema Differences: Per-column mapping of data types, column order, etc. 3. Primary Key Differences: Sample of specific rows that are present in one database, but not the other 4. Value-Level Differences: Sample of differing values for each column with identified discrepancies; full dataset of differences can be downloaded or materialized to the warehouse You can check out [what the results look like in the App](/data-diff/cross-database-diffing/results). 1. Via Datafold’s interactive UI 2. Via the Datafold API 3. On a schedule (as a monitor) with optional alerting via Slack, email, PagerDuty, etc. Yes, users can run as many diffs as they would like with concurrency limited by the underlying database. In such cases, we recommend using watermarking – diffing data within a specified time window of row creation / update (e.g. `updated_at timestamp`). Datafold performs best-effort type matching for cases when deterministic type casting is possible, e.g. comparing `VARCHAR` type with `STRING` type. When automatic type casting without information loss is not possible, the user can define type casting manually using diffing in Query mode. Yes, users can reshape the input dataset by writing a SQL query and diffing in Query mode to bring the dataset to a shape that can be compared with another. Datafold also supports column remapping for datasets with different column names between tables. To make the provisioning at scale easier, you can create data diffs via the [Datafold API](https://docs.datafold.com/reference/cloud/rest-api). --- # Source: https://docs.datafold.com/faq/data-storage-and-security.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Data Storage and Security Datafold ingests and stores various types of data to ensure accurate data quality checks and insights: * **Metadata**: This includes table names, column names, and queries executed in the data warehouse. * **Data for Data Diffs**: * For **in-database diffs**, all data visible in the app, including data samples, is fetched and stored. * For **cross-database diffs**, all data visible in the app, including data samples, is fetched and stored. Larger amounts of data are fetched for comparison purposes, but only data samples are stored. * **Table Profiling in Data Explorer**: Datafold stores samples and distributions of data to provide detailed profiling. --- # Source: https://docs.datafold.com/data-monitoring/monitors/data-test-monitors.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Data Test Monitors > Data Tests validate your data against off-the-shelf checks or custom business rules. Data Test monitors allow you to validate your data using off-the-shelf checks for non-null or unique values, numeric ranges, accepted values, referential integrity, and more. Custom tests let you write custom SQL queries to validate your own business rules. Think of Data Tests as pass/fail—either a test returns no records (pass) or it returns at least one record (fail). Failed records are viewable in the app, materialized to a temporary table in your warehouse, and can even be [attached to notifications as a CSV](/data-monitoring/monitors/data-test-monitors#attach-csvs-to-notifications). ## Create a Data Test monitor There are two ways to create a Data Test monitor: 1. Open the **Monitors** page, select **Create new monitor**, and then choose **Data Test**. 2. Clone an existing Data Test monitor by clicking **Actions** and then **Clone**. This will pre-fill the form with the existing monitor configuration. ## Set up your monitor Select your data connection, then choose whether you'd like to use a [Standard](/data-monitoring/monitors/data-test-monitors#standard-data-tests) or [Custom](/data-monitoring/monitors/data-test-monitors#custom-data-tests) test. ### Standard Data Tests Standard tests allow you to validate your data against off-the-shelf checks for non-null or unique values, numeric ranges, accepted values, referential integrity, and more. After choosing your data connection, select **Standard** and the specific test that you'd like to run. If you don't see the test you're looking for, you can always write a [Custom test](/data-monitoring/monitors/data-test-monitors#custom-data-tests). #### Quoting variables Some test types (e.g. accepted values) require you to provide one or more values, which you may want to have quoted in the final SQL. The **Quote** flag, which is enabled by default, allows you to control this behavior. Here's an example. Quoting **enabled** for `EXAMPLE_VALUE` (default): ```sql theme={null} SELECT * FROM DB.SCHEMA.TABLE1 WHERE "COLUMN1" < 'EXAMPLE_VALUE'; ``` Quoting **disabled** for `EXAMPLE_VALUE`: ```sql theme={null} SELECT * FROM DB.SCHEMA.TABLE1 WHERE "COLUMN1" < EXAMPLE_VALUE; ``` ### Custom Data Tests When you need to test something that's not available in our [Standard tests](/data-monitoring/monitors/data-test-monitors#standard-data-tests), you can write a Custom test. Select your data connection, choose **Custom**, then write your SQL query. Importantly, keep in mind that your query should return records that *fail* the test. Here are some examples to illustrate this. **Custom business rule** Say your company defines active users as individuals who have signed into your application at least 3 times in the past week. You could write a test that validates this logic by checking for users marked as active who *haven't* reached this threshold: ```sql theme={null} SELECT * FROM users WHERE status = 'active' AND signins_last_7d < 3; ``` **Data formatting** If you wanted to validate that all phone numbers in your contacts table are 10 digits and only contain numbers, you'd return records that are not 10 digits or use non-numeric characters: ```sql theme={null} SELECT * FROM contacts WHERE LENGTH(phone_number) != 10 OR phone_number REGEXP '[^0-9]'; ``` ## Add a schedule You can choose to run your monitor daily, hourly, or even input a cron expression for more complex scheduling: ## Add notifications Receive notifications via Slack or email when at least one record fails your test: ## Attach CSVs to notifications Datafold allows attaching a CSV of failed records to Slack and email notifications. This is useful if, for example, you have business users who don't have a Datafold license but need to know about records that fail your tests. This option is configured separately per notification destination as shown here: Attach CSVs to Data Tests notifications CSV attachments are limited to the lesser of 1,000 rows or 1 MB in file size. ### Attaching CSVs in Slack In order to attach CSVs to Slack notifications, you need to complete 1-2 additional steps: 1. If you installed the Datafold Slack app prior to October 2024, you'll need to reinstall the app by visiting Settings > Integrations > Notifications, selecting your Slack integration, then **Reinstall Slack integration**. 2. Invite the Datafold app to the channel you wish to send notifications to using the `/invite` command shown below: Invite Datafold app to Slack channel ## Run Tests in CI Standard Data Tests run on a schedule against your production data. But often it's useful to test data before it gets to production as part of your deployment workflow. For this reason, Datafold supports running tests in CI. Data Tests in CI work very similarly to our [Monitors as Code](/data-monitoring/monitors-as-code) feature, in the sense that you define your tests in a version-controled YAML file. You then use the Datafold SDK to execute those tests as part of your CI workflow. ### Write your tests First, create a new file (e.g. `tests.yaml`) in the root of your repository. Then write your tests use the same format described in our [Monitors as Code](/data-monitoring/monitors-as-code) docs with two exceptions: 1. Add a `run_in_ci` flag to each test and set it to `true` (assuming you'd like to run the test) 2. (Optional) Add placeholders for variables that you'd like to populate dynamically when executing your tests Here's an example: ```yaml theme={null} monitors: null_pk_test: type: test name: No NULL pk in the users table run_in_ci: true connection_id: 8 query: select * from {{ schema }}.USERS where id is null duplicate_pk_test: type: test name: No duplicate pk in the users table run_in_ci: true connection_id: 8 query: | select * from {{ schema }}.USERS where id in ( select id from {{ schema }}.USERS group by id having count(*) > 1 ); ``` ### Execute your tests **INFO** This section describes how to get started with GitHub Actions, but the same concepts apply to other hosted version control platforms like GitLab and Bitbucket. Contact us if you need help getting started. If you're using GitHub Actions, create a new YAML file under `.github/workflows/` using the following template. Be sure to tailor it to your particular setup: ```yaml theme={null} on: push: branches: - main pull_request: jobs: test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - uses: actions/checkout@v2 with: token: ${{ secrets.GH_TOKEN }} repository: datafold/datafold-sdk path: datafold-sdk ref: data-tests-in-ci-demo - uses: actions/setup-python@v2 with: python-version: '3.12' - name: Install dependencies run: | python -m pip install --upgrade pip pip install -r requirements.txt - name: Set schema env var in PR run: | echo "SCHEMA=ANALYTICS.PR" >> $GITHUB_ENV if: github.event_name == 'pull_request' - name: Set schema env var in main run: | echo "SCHEMA=ANALYTICS.CORE" >> $GITHUB_ENV if: github.event_name == 'push' - name: Run tests run: | datafold tests run --var schema:$SCHEMA --ci-config-id 1 tests.yaml # use the correct file name/path env: DATAFOLD_HOST: https://app.datafold.com # different for dedicated deployments DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} # remember to add to secrets ``` ### View the results When your CI workflow is triggered (e.g. by a pull request), you can view the terminal output for your test results: ## Need help? If you have any questions about how to use Data Test monitors, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com). --- # Source: https://docs.datafold.com/security/database-oauth.md # Database OAuth > Datafold enables secure workflows like data diffs through OAuth, ensuring compliance with user-specific database permissions. To improve data security and privacy, Datafold supports running workflows like data diffs through OAuth. This ensures queries are executed using the user's own database credentials, fully complying with granular access controls like data masking and object-level permissions. The diagram below illustrates how the authentication flow proceeds: 1. Users authenticate using the configured OAuth provider. 2. Users can then create diffs between data sets that their user can access using OAuth database permissions. 3. During Continuous Integration (CI), Datafold executes diffs using a Service Account with the least privileges, thus masking sensitive/PII data. 4. If a user needs to see sensitive/PII data from a CI diff, and they have permission via OAuth to do so, they can rerun the diff, and then Datafold will authenticate the user using OAuth database permissions. Then, the user will have access to the data based on these permissions. This structure ensures that diffs are executed with the user's database credentials with their configured roles and permissions. Data access permissions are thus fully managed by the database, and Datafold only passes through queries. --- # Source: https://docs.datafold.com/integrations/databases.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Set Up Your Data Connection > Set up your Data Connection with Datafold. **NOTE** To set up your Data Connection, navigate to **Settings** → **Data Connection** and click **Add New Integration**. --- # Source: https://docs.datafold.com/integrations/databases/databricks.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Databricks **Steps to complete:** 1. [Generate a Personal Access Token](/integrations/databases/databricks#generate-a-personal-access-token) 2. [Retrieve SQL warehouse settings](/integrations/databases/databricks#retrieve-sql-warehouse-settings) 3. [Create schema for Datafold](/integrations/databases/databricks#create-schema-for-datafold) 4. [Configure your data connection in Datafold](/integrations/databases/databricks#configure-in-datafold) ## Generate a Personal Access Token Visit **Settings** → **User Settings**, and then switch to **Personal Access Tokens** tab. Then, click **Generate new token**. Save the generated token somewhere, you'll need it later on. ## Retrieve SQL warehouse settings In **SQL** mode, navigate to **SQL Warehouses**. Choose the preferred warehouse and copy the following fields values from its **Connection Details** tab: * Server hostname * HTTP path ## Create schema for Datafold Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse. ## Configure in Datafold | Field Name | Description | | ---------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | Name | A name given to the data connection within Datafold | | Host | The hostname retrieved in the Connection Details tab | | HTTP Path | The HTTP Path retrieved in the Connection Details tab | | Access Token | The token retrieved in [Generate a Personal Access Token](/integrations/databases/databricks#generate-a-personal-access-token) | | Catalog | The catalog and schema name of your Databricks account. Formatted as catalog\_name.schema\_name (In most cases, catalog\_name is hive\_metastore.) | | Dataset for temporary tables | Certain operations require Datafold to materialize intermediate results, which are stored in a dedicated schema. The input for this field should be in the catalog\_name.schema\_name format. (In most cases, catalog\_name is hive\_metastore.) | Click **Create**. Your data connection is ready! --- # Source: https://docs.datafold.com/api-reference/datafold-api.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Datafold API The Datafold API reference is a guide to our available endpoints and... 3 items 7 items 5 items 14 items 1 item --- # Source: https://docs.datafold.com/datafold-deployment/datafold-deployment-options.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Deployment Options > Datafold is a web-based application with multiple deployment options, including multi-tenant SaaS and dedicated cloud (either customer- or Datafold-hosted). ## SaaS / Multi-Tenant Our standard multi-tenant deployment is a cost-efficient option for most teams and is available in two AWS regions: | Region Name | Region | Sign-Up Page | | :--------------- | :---------- | :------------------------------------------------------------------------- | | US West (Oregon) | `us-west-2` | [https://app.datafold.com/org-signup](https://app.datafold.com/org-signup) | | Europe (Ireland) | `eu-west-1` | [https://eu.datafold.com/org-signup](https://eu.datafold.com/org-signup) | For additional security, we provide the following options: 1. [IP Whitelisting](/security/securing-connections#ip-whitelisting): only allow access to specific IP addresses 2. [AWS PrivateLink](/security/securing-connections#private-link): set up a limited network point to access your RDS in the same region 3. [VPC Peering](/security/securing-connections#vpc-peering-saas): securely join two networks together 4. [SSH Tunnel](/security/securing-connections#ssh-tunnel): set up a secure tunnel between your network and Datafold with the SSH server on your side 5. [IPSec Tunnel](/security/securing-connections#ipsec-tunnel): an IPSec tunnel setup ## Dedicated Cloud We also offer a single-tenant deployment of the Datafold application in a dedicated Virtual Private Cloud (VPC). The options are (from least to most complex): 1. **Datafold-hosted, Datafold-managed**: the Cloud account belongs to Datafold and we manage the Datafold application for you. 2. **Customer-hosted, Datafold-managed**: the Cloud account belongs to you, but we manage the Datafold application for you. 3. **Customer-hosted, Customer-managed**: the Cloud account belongs to you and you manage the Datafold application. Datafold does not have access. Dedicated Cloud can be deployed to all major cloud providers: * [AWS](/datafold-deployment/dedicated-cloud/aws) * [GCP](/datafold-deployment/dedicated-cloud/gcp) * [Azure](/datafold-deployment/dedicated-cloud/azure) **VPC vs. VNet** We use the term VPC across all major cloud providers. However, Azure refers to this concept as a Virtual Network (VNet). ### Datafold Dedicated Cloud FAQ Dedicated Cloud deployment may be the preferred deployment method by customers with special privacy and security concerns and in highly regulated domains. In a Dedicated Cloud deployment, the entire Datafold stack runs on dedicated cloud infrastructure and network, which usually means it is: 1. Not accessible to public Internet (sits behind customer's VPN) 2. Uses internal network to communicate with customer's databases and other resources – none of the data is sent using public networks Datafold is deployed to customer's cloud infrastructure but is fully managed by Datafold. The only DevOps involvement needed from the customer's side is to set up a cloud project and role (steps #1 and #2 below). 1. Customer creates a Datafold-specific namespace in their cloud account (subaccount in AWS / project in GCP / subscription or resource group in Azure) 2. Customer creates a Datafold-specific IAM resource with permissions to deploy the Datafold-specific namespace 3. Datafold Infrastructure team provisions the Datafold stack on the customer's infrastructure using fully-automated procedure with Terraform 4. Customer and Datafold Infrastructure teams collaborate to implement the security and networking requirements, see [all available options](#additional-security-dedicated-cloud) See cloud-specific instructions here: * [AWS](/datafold-deployment/dedicated-cloud/aws) * [GCP](/datafold-deployment/dedicated-cloud/gcp) * [Azure](/datafold-deployment/dedicated-cloud/azure) After the initial deployment, the Datafold team uses the same procedure to roll out software updates and perform maintenance to keep the uptime SLA. Datafold is deployed in the customer's region of choice on AWS, GCP, or Azure that is owned and managed by Datafold. We collaborate to implement the security and networking requirements ensuring that traffic either does not cross the public internet or, if it does, does so securely. All available options are listed below. This deployment method follows the same process as the standard customer-hosted deployment (see above), but with a key difference: the customer is responsible for managing both the infrastructure and the application. Datafold engineers do not have any access to the deployment in this case. We offer open-source projects that facilitate this deployment, with examples for every major cloud provider. You can find these projects on GitHub: * [AWS](https://github.com/datafold/terraform-aws-datafold) * [GCP](https://github.com/datafold/terraform-google-datafold) * [Azure](https://github.com/datafold/terraform-azure-datafold) Each of these projects uses a Helm chart for deploying the application. The Helm chart is also available on GitHub: * [Helm Chart](https://github.com/datafold/helm-charts) By providing these open-source projects, Datafold enables you to integrate the deployment into your own infrastructure, including existing clusters. This allows your infrastructure team to manage the deployment effectively. **Deployment Secrets:** Datafold provides the necessary secrets for downloading images as part of the license agreement. Without this agreement, the deployment will not complete successfully. Because the Datafold application is deployed in a dedicated VPC, your databases/integrations are not directly accessible when they are not exposed to the public Internet. The following solutions enable secure connections to your databases/integrations without exposing them to the public Internet: 1. [PrivateLink](/security/securing-connections?current-cloud=aws#private-link "PrivateLink") 2. [VPC Peering](/security/securing-connections#vpc-peering-dedicated-cloud "VPC Peering") 3. [SSH Tunnel](/security/securing-connections#ssh-tunnel "SSH Tunnel") 4. [IPSec Tunnel](/security/securing-connections#ipsec-tunnel "IPSec Tunnel") 1. [Private Service Connect](/security/securing-connections?current-cloud=gcp#private-link "Private Service Connect") 2. [VPC Peering](/security/securing-connections#vpc-peering-dedicated-cloud "VPC Peering") 3. [SSH Tunnel](/security/securing-connections#ssh-tunnel "SSH Tunnel") 1. [Private Link](/security/securing-connections?current-cloud=azure#private-link "Private Link") 2. [VNet Peering](/security/securing-connections#vpc-peering-dedicated-cloud "VNet Peering") 3. [SSH Tunnel](/security/securing-connections#ssh-tunnel "SSH Tunnel") Please inquire with [sales@datafold.com](mailto:sales@datafold.com) about customer-managed deployment options. --- # Source: https://docs.datafold.com/data-migration-automation/datafold-migration-agent.md # Datafold Migration Agent > Automatically migrate data environments of any scale and complexity with Datafold's Migration Agent. Datafold provides a full-cycle migration automation solution for data teams, which includes code translation and cross-database reconciliation. ## How does DMA work? Datafold performs complete SQL codebase translation and validation using an AI-powered architecture. This approach leverages a large language model (LLM) with a feedback loop optimized for achieving full parity between the migration source and target. DMA analyzes metadata, including schema, data types, and relationships, to ensure accuracy in translation. datafold migration agent architecture Datafold provides a comprehensive report at the end of the migration. This report includes links to data diffs validating parity and highlighting any discrepancies at the dataset, column, and row levels between the source and target databases. ## Why migrate with DMA? Unlike traditional deterministic transpilers, DMA offers several distinct benefits: * **Full parity between source and target:** DMA ensures not just code that compiles, but code that delivers the same results in your new database, complete with explicit validation. * **Flexible dialect handling:** DMA can adapt to any arbitrary input/output dialect without requiring a full grammar definition, which is especially valuable for legacy systems. * **Self-correction capabilities:** The AI-driven DMA can account for and correct mistakes based on both compilation errors and data discrepancies. * **Modernizing code structure:** DMA can convert complex stored procedures into clean, modern formats such as dbt projects, following best practices. ## Getting started with DMA **Want to learn more?** If you're interested in diving deeper, please take a moment to [fill out our intake form](https://nw1wdkq3rlx.typeform.com/to/VC2TbEbz) to connect with the Datafold team. 1. Connect your source and target data sources to Datafold. 2. Provide Datafold access to your codebase, typically by installing the Datafold GitHub/GitLab/ADO app or via system catalog access for stored procedures. Once you connect your source and target systems and Datafold ingests the codebase, DMA's translation process is supervised by the Datafold team. In most cases, no additional input is required from the customer. The migration process timeline depends on the technologies, scale, and complexity of the migration. After setup, migrations typically take several days to several weeks. ## Security Datafold is SOC 2 Type II, GDPR, and HIPAA-compliant. We offer flexible deployment options, including in-VPC setups in AWS, GCP, or Azure. The LLM infrastructure is local, ensuring no data is exposed to external subprocessors beyond the cloud provider. For VPC deployments, data stays entirely within the customer’s private network. ## FAQ For more information, please see our extensive [FAQ section](../faq/data-migration-automation). --- # Source: https://docs.datafold.com/data-migration-automation/datafold-migration-automation.md # Datafold for Migration Automation > Datafold provides full-cycle migration automation with SQL code translation and cross-database validation for data warehouse, transformation framework, and hybrid migrations. Datafold offers flexible migration validation options to fit your data migration workflow. Data teams can choose to leverage the full power of the [Datafold Migration Agent (DMA)](../data-migration-automation/datafold-migration-agent) alongside [cross-database diffing](../data-diff/how-datafold-diffs-data#how-cross-database-diffing-works), or use ad-hoc diffing exclusively for validation. ## Supported migrations Datafold supports a wide range of migrations to meet the needs of modern data teams. The platform enables smooth transitions between different databases and transformation frameworks, ensuring both code translation and data validation throughout the migration process. Datafold can handle: * **Data Warehouse Migrations:** Seamlessly migrate between data warehouses, for example, from PostgreSQL to Databricks. * **Data Transformation Framework Migrations:** Transition your transformation framework from legacy stored procedures to modern tools like dbt. * **Hybrid Migrations:** Migrate across a combination of data platforms and transformation frameworks. For example, moving from MySQL + stored procedures to Databricks + dbt. ## Migration options The AI-powered Datafold Migration Agent (DMA) provides automated SQL code translation and validation to simplify and automate data migrations. Teams can pair DMA with ad-hoc cross-database diffing to enhance the validation process with additional manual checks when necessary. **How it works:** * **Step 1:** Connect your legacy and new databases to Datafold, along with your codebase. * **Step 2:** DMA translates and validates SQL code automatically. * **Step 3:** Pair the DMA output with ad-hoc cross-database diffing to reconcile data between legacy and new databases. This combination streamlines the migration process, offering automatic validation with the flexibility of manual diffing for fine-tuned control. For teams that prefer to handle code translation manually or are working with third-party migrations, Datafold's ad-hoc cross-database diffing is available as a stand-alone validation tool. **How it works:** * Validate data across databases manually without using DMA for code translation. * Run ad-hoc diffing as needed, via the [Datafold REST API](../api-reference/introduction), or schedule it with [Monitors](../data-monitoring) for continuous validation. This option gives you full control over the migration validation process, making it suitable for in-house or outsourced migrations. --- # Source: https://docs.datafold.com/api-reference/datafold-sdk.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Datafold SDK The Datafold SDK allows you to accomplish certain actions using a thin programmatic wrapper around the Datafold REST API, in particular: * **Custom CI Integrations**: Submitting information to Datafold about what tables to diff in CI * **dbt CI Integrations**: Submitting dbt artifacts via CI runner * **dbt development**: Kick off data diffs from the command line while developing in your dbt project ## Install First, create and activate your virtual environment for Python: ``` python3 -m venv venv source venv/bin/activate pip install --upgrade pip setuptools wheel ``` Now, you're ready to install the Datafold SDK: ``` pip install datafold-sdk ``` #### CLI environment variables To use the Datafold CLI, you need to set up some environment variables: ```bash theme={null} export DATAFOLD_API_KEY=XXXXXXXXX ``` If your Datafold app URL is different from the default `app.datafold.com`, set the custom domain as the variable: ```bash theme={null} export DATAFOLD_HOST= ``` ## Custom CI Integrations Please follow [our CI orchestration docs](../integrations/orchestrators/custom-integrations) to set up a custom CI integration levering the Datafold SDK. ## dbt Core CI Integrations When you set up Datafold CI diffing for a dbt Core project, we rely on the submission of `manifest.json` files to represent the production and staging versions of your dbt project. Please see our detailed docs on how to [set up Datafold in CI for dbt Core](../integrations/orchestrators/dbt-core), and reach out to our team if you have questions. #### CLI ```bash theme={null} datafold dbt upload \ --ci-config-id \ --run-type \ --target-folder \ --commit-sha ``` #### Python ```python theme={null} import os from datafold_sdk.sdk.dbt import submit_artifacts api_key = os.environ.get('DATAFOLD_API_KEY') # only needed if your Datafold app url is not app.datafold.com host = os.environ.get("DATAFOLD_HOST") submit_artifacts(host=host, api_key=api_key, ci_config_id=, run_type='', target_folder='', commit_sha='') ``` ## Diffing dbt models in development It can be beneficial to diff between two dbt environments before opening a pull request. This can be done using the Datafold SDK from the command line: ```bash theme={null} datafold diff dbt ``` That command will compare data between your development and production environments. By default, all models that were built in the previous `dbt run` or `dbt build` command will be compared. ### Running Data Diffs before opening a pull request It can be helpful to view Data Diff results in your ticket before creating a pull request. This enables faster code reviews by letting developers QA changes earlier. To do this, you can create a draft PR and run the following command: ``` dbt run && datafold diff dbt ``` This executes dbt locally and triggers a Data Diff to preview data changes without committing to Git. To automate this workflow, see our guide [here](/faq/datafold-with-dbt#can-i-run-data-diffs-before-opening-a-pr). ### Update your dbt\_project.yml with configurations #### Option 1: Add variables to the `dbt_project.yml` ```yaml theme={null} # dbt_project.yml vars: data_diff: prod_database: my_default_database # default database for the prod target prod_schema: my_default_schema # default schema for the prod target prod_custom_schema: PROD_ # Optional: see dropdown below ``` **Additional schema variable details** The value for `prod_custom_schema:` will vary based on how you have setup dbt. This variable is used when a model has a custom schema and becomes ***dynamic*** when the string literal `` is present. The `` substring is replaced with the custom schema for the model in order to support the various ways schema name generation can be overridden here -- also referred to as "advanced custom schemas". **Examples (not exhaustive)** **Single production schema** *If your prod environment looks like this ...* ```bash theme={null} PROD.ANALYTICS ``` *... your data-diff configuration should look like this:* ```yaml theme={null} vars: data_diff: prod_database: PROD prod_schema: ANALYTICS ``` **Some custom schemas in production with a prefix like "prod\_"** *If your prod environment looks like this ...* ```bash theme={null} PROD.ANALYTICS PROD.PROD_MARKETING PROD.PROD_SALES ``` *... your data-diff configuration should look like this:* ```yaml theme={null} vars: data_diff: prod_database: PROD prod_schema: ANALYTICS prod_custom_schema: PROD_ ``` **Some custom schemas in production with no prefix** *If your prod environment looks like this ...* ```yaml theme={null} PROD.ANALYTICS PROD.MARKETING PROD.SALES ``` *... your data-diff configuration should look like this:* ```yaml theme={null} vars: data_diff: prod_database: PROD prod_scheam: ANALYTICS prod_custom_schema: ``` #### Option 2: Specify a production `manifest.json` using `--state` **Using the `--state` option is highly recommended for dbt projects with multiple target database and schema configurations. For example, if you customized the [`generate_schema_name`](https://docs.getdbt.com/docs/build/custom-schemas#understanding-custom-schemas) macro, this is the best option for you.** > Note: `dbt ls` is preferred over `dbt compile` as it runs faster and data diffing does not require fully compiled models to work. ```bash theme={null} dbt ls -t prod # compile a manifest.json using the "prod" target mv target/manifest.json prod_manifest.json # move the file up a directory and rename it to prod_manifest.json dbt run # run your entire dbt project or only a subset of models with `dbt run --select ` data-diff --dbt --state prod_manifest.json # run data-diff to compare your development results to the production database/schema results in the prod manifest ``` #### Add your Datafold data connection integration ID to your dbt\_project.yml To connect to your database, navigate to **Settings** → **Integrations** → **Data connections** and click **Add new integration** and follow the prompts. After you **Test and Save**, add the ID (which can be found on Integrations > Data connections) to your **dbt\_project.yml**. ```yaml theme={null} # dbt_project.yml vars: data_diff: ... datasource_id: ``` The following optional arguments are available: | Options | Description | | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | `--version` | Print version info and exit. | | `-w, --where EXPR` | An additional 'where' expression to restrict the search space. Beware of SQL Injection! | | `--dbt-profiles-dir PATH` | Which directory to look in for the `profiles.yml` file. If not set, we follow the default `profiles.yml` location for the dbt version being used. Can also be set via the `DBT_PROFILES_DIR` environment variable. | | `--dbt-project-dir PATH` | Which directory to look in for the `dbt_project.yml` file. Default is the current working directory and its parents. | | `--select SELECTION or MODEL_NAME` | Select dbt resources to compare using dbt selection syntax in dbt versions >= 1.5. In versions \< 1.5, it will naively search for a model with `MODEL_NAME` as the name. | | `--state PATH` | Specify manifest to utilize for 'prod' comparison paths instead of using configuration. | | `-pd, --prod-database TEXT` | Override the dbt production database configuration within `dbt_project.yml`. | | `-ps, --prod-schema TEXT` | Override the dbt production schema configuration within `dbt_project.yml`. | | `--help` | Show this message and exit. | --- # Source: https://docs.datafold.com/faq/datafold-with-dbt.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Integrating Datafold with dbt You need Datafold in addition to dbt tests because while dbt tests are effective for validating specific assertions about your data, they can't catch all issues, particularly unknown unknowns. Datafold identifies value-level differences between staging and production datasets, which dbt tests might miss. Unlike dbt tests, which require manual configuration and maintenance, Datafold automates this process, ensuring continuous and comprehensive data quality validation without additional overhead. This is all embedded within Datafold’s unified platform that offers end-to-end data quality testing with our [Column-level Lineage](/data-explorer/lineage) and [Data Monitors](/data-monitoring/monitor-types). Hence, we recommend combining dbt tests with Datafold to achieve complete test coverage that addresses both known and unknown data quality issues, providing a robust safeguard against potential data integrity problems in your CI pipeline. For dbt Core users, create an integration in Datafold, specify the necessary settings, obtain a Datafold API Key and CI config ID, and configure your CI scripts with the Datafold SDK to upload manifest.json files. Our detailed setup guide [can be found here](/integrations/orchestrators/dbt-core). For dbt Cloud users, set up dbt Cloud CI to run Pull Request jobs and create an Artifacts Job that generates production manifest.json on merges to main/master. Obtain your dbt Cloud access URL and a Service Token, then create a dbt Cloud integration in Datafold using these credentials. Configure the integration with your repository, data connection, primary key tag, and relevant jobs. Our detailed setup guide [can be found here](/integrations/orchestrators/dbt-cloud). Yes, Datafold is fully compatible with the custom PR schema created by dbt Cloud for Slim CI jobs. We outline effective strategies for efficient and scalable data diffing in our[performance and scalability guide](faq/performance-and-scalability#how-can-i-optimize-diff-performance-at-scale). For dbt-specific diff performance, you can exclude certain columns or tables from data diffs in your CI/CD pipeline by adjusting the **Advanced settings** in your Datafold CI/CD configuration. This helps reduce processing load by focusing diffs on only the most relevant columns. Some teams want to show Data Diff results in their tickets *before* creating a pull request. This speeds up code reviews as developers can QA code changes before requesting a PR review. You can trigger a Data Diff by first creating a **draft PR** and then running the following command via the CLI: ```bash theme={null} dbt run && datafold diff dbt ``` This command runs `dbt` locally and then triggers a Data Diff, allowing you to preview data changes without pushing to Git. To automate this process of kicking off a Data Diff before pushing code to git, we recommend creating a GitHub Actions job for draft PRs. For example: ``` name: Data Diff on draft dbt PR on: pull_request: types: [opened, reopened, synchronize] branches: - '!main' jobs: run: if: github.event.pull_request.draft == true # Run only on draft PRs runs-on: ubuntu-latest steps: - name: Checkout Code uses: actions/checkout@v2 - name: Set Up Python uses: actions/setup-python@v2 with: python-version: '3.8' - name: Install requirements run: pip install -r requirements.txt - name: Install dbt dependencies run: dbt deps # Update with your S3 bucket details - name: Grab production manifest from S3 run: | aws s3 cp s3://advanced-ci-manifest-demo/manifest.json ./manifest.json env: AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} AWS_REGION: us-east-1 - name: Run dbt and Data Diff env: DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} run: | dbt run datafold diff dbt # Optional: Submit artifacts to Datafold for more analysis or logging - name: Submit artifacts to Datafold run: | set -ex datafold dbt upload --ci-config-id 350 --run-type pull_request --commit-sha ${GIT_SHA} env: DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} GIT_SHA: "${{ github.event.pull_request.head.sha }}" ``` --- # Source: https://docs.datafold.com/integrations/orchestrators/dbt-cloud.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # dbt Cloud > Integrate Datafold with dbt Cloud to automate Data Diffs in your CI pipeline, leveraging dbt jobs to detect changes and ensure data quality before merging. **NOTE** You will need a dbt **Team** account or higher to access the dbt Cloud API that Datafold uses to connect the accounts. ## Prerequisites ### Set up dbt Cloud CI In dbt Cloud, [set up dbt Cloud CI](https://docs.getdbt.com/docs/deploy/cloud-ci-job) so that your Pull Request job runs when you open or update a Pull Request. This job will provide Datafold information about the changes included in the PR. ### Create an Artifacts Job in dbt Cloud The Artifacts job generates production `manifest.json` on merge to main/master, giving Datafold information about the state of production. The simplest method is to set up a dbt Cloud job that executes the `dbt ls` command on merge to main/master. > Note: `dbt ls` is preferred over `dbt compile` as it runs faster and data diffing does not require fully compiled models to work. Example dbt Cloud artifact job settings and successful run: If you are interested in continuous deployment, you can use a [Merge Trigger Production Job](https://docs.datafold.com/cd#merge-trigger-production-job) instead of the Artifacts Job listed above. ### dbt Cloud Access URL You will need your [access url](https://docs.getdbt.com/docs/cloud/about-cloud/regions-ip-addresses) to connect Datafold to your dbt Cloud account. ### Add dbt Cloud Service Account Token To connect Datafold to your dbt Cloud account, you will need to use a [Service Token](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens). info Please note that the use of User API Keys for this purpose is no longer recommended due to a [recent security update](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens) in dbt Cloud. [Learn more below](/integrations/orchestrators/dbt-cloud#deprecating-user-tokens) 1. Navigate to **Account Settings → Service Tokens → + New Token**. 1. Add a Permission Set and select `Member` or `Developer`. 2. Select `All Projects`, or check only the projects you intend to use with Datafold. 3. Save your changes. 1. Navigate to **Your Profile → API Access** and copy the token. #### Deprecating User Tokens dbt Cloud is transitioning away from the use of User API Keys for authentication. The User API Key will be replaced by account-scoped Personal Access Tokens (PATs). This update will affect the functionality of certain API endpoints. Specifically, `/v2/accounts`, `/v3/accounts`, and `/whoami` (undocumented API) will no longer return information about all the accounts tied to a user. Instead, the response will be filtered to include only the context of the specific account in the request. dbt Cloud users have until April 30, 2024, to implement this change. After this date, all user API keys will be scoped to an account. New customers are required to use the new account-scoped PATs. For more information, please refer to the [dbt Cloud API Documentation](https://docs.getdbt.com/docs/dbt-cloud-apis/service-tokens). If you have any questions or require further assistance, please don't hesitate to contact our support team. ## Create a dbt Cloud Integration in the Datafold app * Navigate to Settings > Integrations > CI and create a new dbt Cloud integration. ## Configuration ### Basic Settings * **Repository**: Select a repository that you set up in [the Code Repositories setup step](/integrations/code-repositories). * **Data Connection**: Select a connection that you set up in [the Data Connections setup step](/integrations/databases). * **Name**: This can be anything! * **Primary key tag**: This is a text string that you may use to tag primary keys in your dbt project yaml. Note that to avoid the need for tagging, [primary keys can be inferred from dbt uniqueness tests](/deployment-testing/configuration/primary-key). * **Account name**: This will be autofilled using your dbt API key. * **Job that creates dbt artifacts**: This will be [the Artifacts Job that you created](#create-an-artifacts-job-in-dbt-cloud). Or, if you have a dbt production job that runs on each merge to main, select that job. * **Job that builds pull requests**: This is the dbt CI job that is triggered when you open a Pull Request or Merge Request. ### Advanced Settings * **Enable Datafold in CI/CD**: High-level switch to turn Datafold off or on in CI (but we hope you'll leave it on!). * **Import dbt tags and descriptions**: Populate our Lineage tool with dbt metadata. ⚠️ This feature is in development. ⚠️ * **Slim Diff**: Only diff modified models in CI, instead of all models. [Please read more about Slim Diff](/deployment-testing/best-practices/slim-diff), which is highly configurable using dbt yaml, and each organization will need to set a strategy based on their data environment. * Downstream Hightouch models will be diffed even when Slim Diff is turned on. * **Diff Hightouch Models**: Hightouch customers can see diffs of downstream Hightouch assets in Pull Requests. * **CI fails on primary key issues**: The existence of null or duplicate primary keys causes the Datafold CI check to fail. * **Pull Request Label**: For when you want Datafold to *only* run in CI when a label is manually applied in GitHub/GitLab. * **CI Diff Threshold**: For when you want Datafold to *only* run automatically if the number of diffs doesn't exceed this threshold for a given CI run. * **Files to ignore**: If at least one modified file doesn’t match the ignore pattern, Datafold CI diffs all changed models in the PR. If all modified files should be ignored, Datafold CI does not run in the PR. ([Additional details.](/deployment-testing/configuration/datafold-ci/on-demand)) * **Custom base branch**: For when you want Datafold to **only** run in CI when a PR is opened against a specific base branch. You might need this if you have multiple environments built from different branches. See [Custom branch](https://docs.getdbt.com/faqs/Environments/custom-branch-settings) in dbt Cloud docs. Click save, and that's it! Now that you've set up a dbt Cloud integration, Datafold will diff your impacted tables whenever you push commits to a PR. A summary of the diff will appear in GitHub, and detailed results will appear in the Datafold app. --- # Source: https://docs.datafold.com/integrations/orchestrators/dbt-core.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # dbt Core > Set up Datafold’s integration with dbt Core to automate Data Diffs in your CI pipeline. **PREREQUISITES** * Create a [Data Connection Integration](/integrations/databases) where your dbt project data is built. * Create a [Code Repository Integration](/integrations/code-repositories) where your dbt project code is stored. ## Getting started To add Datafold to your continuous integration (CI) pipeline using dbt Core, follow these steps: ### 1. Create a dbt Core integration. ### 2. Set up the dbt Core integration. Complete the configuration by specifying the following fields: #### Basic settings | Field Name | Description | | ------------------ | ------------------------------------------------------------------------------------------ | | Configuration name | Choose a name for your for your Datafold dbt integration. | | Repository | Select your dbt project. | | Data Connection | Select the data connection your dbt project writes to. | | Primary key tag | Choose a string for [tagging primary keys](/deployment-testing/configuration/primary-key). | #### Advanced settings: Configuration | Field Name | Description | | -------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Import dbt tags and descriptions | Import dbt metadata (including column and table descriptions, tags, and owners) to Datafold. | | Slim Diff | Data diffs will be run only for models changed in a pull request. See our [guide to Slim Diff](/deployment-testing/best-practices/slim-diff) for configuration options. | | Diff Hightouch Models | Run Data Diffs for Hightouch models affected by your PR. | | CI fails on primary key issues | The existence of null or duplicate primary keys will cause CI to fail. | | Pull Request Label | When this is selected, the Datafold CI process will only run when the `datafold` label has been applied. | | CI Diff Threshold | Data Diffs will only be run automatically for a given CI run if the number of diffs doesn't exceed this threshold. | | Branch commit selection strategy | Select "Latest" if your CI tool creates a merge commit (the default behavior for GitHub Actions). Choose "Merge base" if CI is run against the PR branch head (the default behavior for GitLab). | | Custom base branch | If defined, CI will run only on pull requests with the specified base branch. | | Columns to ignore | Use standard gitignore syntax to identify columns that Datafold should never diff for any table. This can [improve performance](/faq/performance-and-scalability#how-can-i-optimize-diff-performance-at-scale) for large datasets. Primary key columns will not be excluded even if they match the pattern. | | Files to ignore | If at least one modified file doesn’t match the ignore pattern, Datafold CI diffs all changed models in the PR. If all modified files should be ignored, Datafold CI does not run in the PR. ([Additional details.](/deployment-testing/configuration/datafold-ci/on-demand)) | #### Advanced settings: Sampling Sampling allows you to compare large datasets more efficiently by checking only a randomly selected subset of the data rather than every row. By analyzing a smaller but statistically meaningful sample, Datafold can quickly estimate differences without the overhead of a full dataset comparison. To learn more about how sampling can result in a speedup of 2x to 20x or more, see our [best practices on sampling](/data-diff/cross-database-diffing/best-practices#enable-sampling). | Field Name | Description | | ------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Enable sampling | Enable sampling for data diffs to optimize analyzing large datasets. | | Sampling tolerance | The tolerance to apply in sampling for all data diffs. | | Sampling confidence | The confidence to apply when sampling. | | Sampling threshold | Sampling will be disabled automatically if tables are smaller than specified threshold. If unspecified, default values will be used depending on the Data Connection type. | ### 3. Obtain an Datafold API Key and CI config ID. After saving the settings in step 2, scroll down and generate a new Datafold API Key and obtain the CI config ID. ### 4. Configure your CI script(s) with the Datafold SDK. Using the Datafold SDK, configure your CI script(s) to upload dbt `manifest.json` files. The `datafold dbt upload` command takes this general form and arguments: ``` datafold dbt upload --ci-config-id --run-type --commit-sha ``` You will need to configure orchestration to upload the dbt `manifest.json` files in 2 scenarios: 1. **On merges to main.** These `manifest.json` files represent the state of the dbt project on the base/production branch from which PRs are created. 2. **On updates to PRs.** These `manifest.json` files represent the state of the dbt project on the PR branch. The dbt Core integration creation form automatically generates code snippets that can be added to CI runners. By storing and comparing these `manifest.json` files, Datafold determines which dbt models to diff in a CI run. Implementation details vary depending on which CI tool you use. Please review [these instructions and examples](#ci-implementation-tools) to help you configure updates to your organization's CI scripts. ### 5. Test your dbt Core integration. After updating your CI scripts, trigger jobs that will upload `manifest.json` files represent the base/production state. Then, open a new pull request with changes to a SQL file to trigger a CI run. ## CI implementation tools We've created guides and templates for three popular CI tools. **Having trouble setting up Datafold in CI?** We're here to help! Please reach out and [chat with a Datafold Solutions Engineer](https://www.datafold.com/booktime). To add Datafold to your CI tool, add `datafold dbt upload` steps in two CI jobs: * **Upload Production Artifacts:** A CI job that build a production `manifest.json`. *This can be either your Production Job or a special Artifacts Job that runs on merge to main (explained below).* * **Upload Pull Request Artifacts:** A CI job that builds a PR `manifest.json`. This ensures Datafold always has the necessary `manifest.json` files, enabling us to run data diffs comparing production data to dev data. **Upload Production Artifacts** Add the `datafold dbt upload` step to *either* your Production Job *or* an Artifacts Job. **Production Job** If your dbt prod job kicks off on merges to the base branch, add a `datafold dbt upload` step after the `dbt build` step. ```bash theme={null} name: Production Job on: push: branches: - main jobs: run: runs-on: ubuntu-20.04 steps: - name: Install Datafold SDK run: pip install -q datafold-sdk - name: Upload dbt artifacts to Datafold run: datafold dbt upload --ci-config-id --run-type production --commit-sha ${GIT_SHA} env: DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} GIT_SHA: "${{ github.sha }}" ``` **Artifacts Job** If your existing Production Job runs on a schedule and not on merges to the base branch, create a dedicated job that runs on merges to the base branch which generates and uploads a `manifest.json` file to Datafold. ```bash theme={null} name: Artifacts Job on: push: branches: - main jobs: run: runs-on: ubuntu-20.04 steps: - name: Install Datafold SDK run: pip install -q datafold-sdk - name: Generate dbt manifest.json run: dbt ls - name: Upload dbt artifacts to Datafold run: datafold dbt upload --ci-config-id --run-type production --commit-sha ${BASE_GIT_SHA} env: DATAFOLD_APIKEY: ${{ secrets.DATAFOLD_APIKEY }} BASE_GIT_SHA: "${{ github.sha }}" ``` **Pull Request Artifacts** Include the `datafold dbt upload` step in your CI job that builds PR data. ```bash theme={null} name: Pull Request Job on: pull_request: push: branches: - '!main' jobs: run: runs-on: ubuntu-20.04 steps: - name: Install Datafold SDK run: pip install -q datafold-sdk - name: Upload PR manifest.json to Datafold run: | datafold dbt upload --ci-config-id --run-type pull_request --commit-sha ${PR_GIT_SHA} env: DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} PR_GIT_SHA: "${{ github.event.pull_request.head.sha }}" ``` **Store Datafold API Key** Save the API key as `DATAFOLD_API_KEY` in your [GitHub repository settings](https://docs.github.com/en/actions/security-guides/encrypted-secrets#creating-encrypted-secrets-for-a-repository). **Upload Production Artifacts** Add the `datafold dbt upload` step to *either* your Production Job *or* an Artifacts Job. **Production Job** If your dbt prod job kicks off on merges to the base branch, add a `datafold dbt upload` step after the `dbt build` step. ```bash theme={null} version: 2.1 jobs: prod-job: filters: branches: only: main docker: - image: cimg/python:3.9 steps: - checkout - run: name: "Install Datafold SDK" command: pip install -q datafold-sdk - run: name: "Build dbt project" command: dbt build - run: name: "Upload production manifest.json to Datafold" command: | datafold dbt upload --ci-config-id --run-type production --target-folder ./target/ --commit-sha ${CIRCLE_SHA1} ``` **Artifacts Job** If your existing Production Job runs on a schedule and not on merges to the base branch, create a dedicated job that runs on merges to the base branch which generates and uploads a `manifest.json` file to Datafold. ```bash theme={null} version: 2.1 jobs: artifacts-job: filters: branches: only: main docker: - image: cimg/python:3.9 steps: - checkout - run: name: "Install Datafold SDK" command: pip install -q datafold-sdk - run: name: "Generate manifest.json" command: dbt ls --profiles-dir ./ - run: name: "Upload production manifest.json to Datafold" command: datafold dbt upload --ci-config-id --run-type production --target-folder ./target/ --commit-sha ${CIRCLE_SHA1} ``` **Store Datafold API Key** Save the API key in the [CircleCI interface](https://circleci.com/docs/set-environment-variable/). **Upload Production Artifacts** Add the `datafold dbt upload` step to *either* your Production Job *or* an Artifacts Job. **Production Job** If your dbt prod job kicks off on merges to the base branch, add a `datafold dbt upload` step after the `dbt build` step. ```bash theme={null} image: name: ghcr.io/dbt-labs/dbt-core:1.x run_pipeline: stage: deploy before_script: - pip install -q datafold-sdk script: - dbt build --profiles-dir ./ - datafold dbt upload --ci-config-id --run-type production --commit-sha $CI_COMMIT_SHA ``` **Artifacts Job** If your existing Production Job runs on a schedule and not on merges to the base branch, create a dedicated job that runs on merges to the base branch which generates and uploads a `manifest.json` file to Datafold. ```bash theme={null} image: name: ghcr.io/dbt-labs/dbt-core:1.x run_pipeline: stage: deploy before_script: - pip install -q datafold-sdk script: - dbt ls --profiles-dir ./ - datafold dbt upload --ci-config-id --run-type production --commit-sha $CI_COMMIT_SHA ``` **Store Datafold API Key** Save the API key as `DATAFOLD_API_KEY` in [GitLab repository settings](https://docs.gitlab.com/ee/ci/yaml/index.html#secrets). ## CI for dbt multi-projects When setting up CI for dbt multi-projects, each project should have its own dedicated CI integration to ensure that changes are validated independently. ## CI for dbt multi-projects within a monorepo When managing multiple dbt projects within a monorepo (a single repository), it’s essential to configure individual Datafold CI integrations for each project to ensure proper isolation. This approach prevents unintended triggering of CI processes for projects unrelated to the changes made. Here’s the recommended approach for setting it up in Datafold: **1. Create separate CI integrations:** Create separate CI integrations within Datafold, one for each dbt project within the monorepo. Each integration should be configured to reference the same GitHub repository. **2. Configure file filters**: For each CI integration, define file filters to specify which files should trigger the CI run. These filters prevent CI runs from being initiated when files from other projects in the monorepo are updated. **3. Test and validate**: Before deployment, test each CI integration to validate that it triggers only when changes occur within its designated dbt project. Verify that modifications to files in one project do not inadvertently initiate CI processes for unrelated projects in the monorepo. ### ## Advanced configurations ### Skip Datafold in CI To skip the Datafold step in CI, include the string `datafold-skip-ci` in the last commit message. ### Programmatically trigger CI runs The Datafold app relies on the version control service webhooks to trigger the CI runs. When the dedicated cloud deployments is behind a VPN, webhooks cannot directly reach the deployment due to the network's restricted access. We can overcome this by triggering the CI runs via the [datafold-sdk](/api-reference/datafold-sdk) in the Actions/Job Runners, assuming they're running in the same network. Add a new Datafold SDK command after uploading the manifest in a PR job: **Important** When configuring your CI script, be sure to use `${{ github.event.pull_request.head.sha }}` for the **Pull Request Job** instead of `${{ github.sha }}`, which is often mistakenly used. `${{ github.sha }}` defaults to the latest commit SHA on the branch and **will not work correctly for pull requests**. ```Bash theme={null} - -name: Trigger CI run: | set -ex datafold ci trigger --ci-config-id \ --pr-num ${PR_NUM} \ --base-branch ${BASE_BRANCH} \ --base-sha ${BASE_SHA} \ --pr-branch ${PR_BRANCH} \ --pr-sha ${PR_SHA} env: DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} DATAFOLD_HOST: ${{ secrets.DATAFOLD_HOST }} PR_NUM: ${{ github.event.number }} PR_BRANCH: ${{ github.event.pull_request.head.ref }} BASE_BRANCH: ${{ github.event.pull_request.base.ref }} PR_SHA: ${{ github.event.pull_request.head.sha }} BASE_SHA: ${{ github.event.pull_request.base.sha }} ``` ### Running diffs before opening a PR Some teams want to show Data Diff results in their tickets *before* creating a pull request. This speeds up code reviews as developers can QA code changes before requesting a PR review. Check out how to automate this workflow [here](/faq/datafold-with-dbt#can-i-run-data-diffs-before-opening-a-pr). --- # Source: https://docs.datafold.com/data-explorer/best-practices/dbt-metadata-sync.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # dbt Metadata Sync > Datafold can automatically ingest dbt metadata from your production environment and display it in Data Explorer. **INFO** You can enable the metadata sync in your Orchestration settings. Please note that when this feature is enabled, user editing of table metadata is disabled. ### Model-level The following model-level information can be synced: * `description` is synchronized into the description field of the table into Lineage. * The `owner` of the table is set to the user identified by the `user@company.com` field. This user must exist in Datafold with that email. * The `foo` meta-information is added to the description field with the value `bar`. * The tags `pii` and `bar` are applied to the table as tags. Here's an example configuration in YAML format: ```Bash theme={null} models: - name: users description: "Description of the table" meta: owner: user@company.com foo: bar tags: - pii - abc ``` ### Column-level The following column-level information can be synced: * The column `user_id` has two tags applied: `pk` and `id`. * The metadata for `user_id` is ignored because it reflects the primary key tag. * The `email` column has the description applied. * The `email` column has the tag `pii` applied. * The `email` column has extra metadata information in the description field: `type` with the value `email`. Here's an example configuration for columns in YAML format: ```Bash theme={null} models: - name: users ... columns: - name: user_id tags: - pk - id meta: pk: true - name: email description: "The user's email" tags: - pii meta: type: email ``` --- # Source: https://docs.datafold.com/integrations/bi-data-apps/dbt.md # dbt Exposures > Incorporate dbt Exposures into your Datafold lineage. In dbt, Exposures allow you to define downstream uses of your data (e.g., in dashboards). You can include dbt Exposures in lineage within Data Explorer using our dbt Exposures integration. ## Set up the integration If you haven't aleady created a dbt CI integration, please start [there](/integrations/orchestrators/). 1. Visit Settings > BI & Data Apps > Add new integration 2. Select "dbt Exposures" 3. Enter a name for the integration (this can be anything) 4. Select your existing dbt CI integration from the dropdown 5. Save the integration Add dbt Exposures integration Configure dbt Exposures integration ## View dbt Exposures in Data Explorer Your dbt Exposures may not appear in lineage immediately after setting up the integration. To force an update, return to the integration settings and select "Sync now". When you visit Data Explorer, you'll now see the option to filter for dbt Exposures: Filter for dbt Exposures Your dbt Exposures will also appear in lineage: View dbt Exposures in lineage --- # Source: https://docs.datafold.com/api-reference/monitors/delete-a-monitor.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Delete a Monitor ## OpenAPI ````yaml openapi-public.json delete /api/v1/monitors/{id} openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/monitors/{id}: delete: tags: - Monitors summary: Delete a Monitor operationId: delete_monitor_api_v1_monitors__id__delete parameters: - description: The unique identifier of the monitor. in: path name: id required: true schema: description: The unique identifier of the monitor. title: Id type: integer responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/deployment-testing/configuration/model-specific-ci/diff-timeline.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Diff Timeline > Specify a `time_column` to visualize match rates between tables for each column over time. ```Bash theme={null} models: - name: users meta: datafold: datadiff: time_column: created_at ``` --- # Source: https://docs.datafold.com/integrations/databases/dremio.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Dremio **INFO** Column-level Lineage is not currently supported for Dremio. **INFO** Schemas for tables in external data sources need to be specified with quotes e.g., "Postgres prod.analytics.sales". **Steps to complete:** 1. [Configure user in Dremio](/integrations/databases/dremio#configure-user-in-dremio) 2. [Create schema for Datafold](/integrations/databases/dremio#create-schema-for-datafold) 3. [Configure your data connection in Datafold](/integrations/databases/dremio#configure-in-datafold) ## Configure user in Dremio To connect to Dremio, create a user with read-only access to all data sources you wish to diff and generate an access token. Temporary tables will be created in the `$scratch` schema that doesn't require special permissions. ## Create schema for Datafold Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse. ## Configure in Datafold | Field Name | Description | | --------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | | Connection name | A name given to the data connection within Datafold. | | Host | The hostname for your Dremio instance (data.dremio.cloud for Dremio SaaS). | | Port | Dremio endpoint port; default value is 433. | | Encryption | Should be checked for Dremio Cloud, possibly unchecked for local deployments. | | User ID | User ID as created in Dremio, typically an email address. | | Project ID | Dremio Project UID. If left blank, the default project will be used. | | Token | Access token generated in Dremio. | | Password | Alternatively, provide a password. | | Schema for temporary views | A Dremio space for temporary views. | | Schema for temporary tables | \$scratch should suit most applications, or use "\.\" (with quotes) if you wish to create temporary tables in an external data source. | Click **Create**. Your data connection is now ready! --- # Source: https://docs.datafold.com/deployment-testing/configuration/model-specific-ci/excluding-models.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Excluding Models > Use `never_diff` to exclude a model or subdirectory of models from data diffs. ```Bash theme={null} models: - name: users meta: datafold: datadiff: never_diff: true ``` --- # Source: https://docs.datafold.com/api-reference/data-sources/execute-a-sql-query-against-a-data-source.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Execute a SQL query against a data source > Executes a SQL query against the specified data source and returns the results. This endpoint allows you to run ad-hoc SQL queries for data exploration, validation, or analysis. The query is executed using the data source's native query runner with the appropriate credentials. **Streaming mode**: Use query parameter `?stream=true` or set `X-Stream-Response: true` header. Streaming is only supported for certain data sources (e.g., Databricks). When streaming, results are sent incrementally as valid JSON for memory efficiency. Returns: - Query results as rows with column metadata (name, type, description) - Limited to a reasonable number of rows for performance ## OpenAPI ````yaml openapi-public.json post /api/v1/data_sources/{data_source_id}/query openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/data_sources/{data_source_id}/query: post: tags: - Data sources summary: Execute a SQL query against a data source description: >- Executes a SQL query against the specified data source and returns the results. This endpoint allows you to run ad-hoc SQL queries for data exploration, validation, or analysis. The query is executed using the data source's native query runner with the appropriate credentials. **Streaming mode**: Use query parameter `?stream=true` or set `X-Stream-Response: true` header. Streaming is only supported for certain data sources (e.g., Databricks). When streaming, results are sent incrementally as valid JSON for memory efficiency. Returns: - Query results as rows with column metadata (name, type, description) - Limited to a reasonable number of rows for performance operationId: run_query parameters: - in: path name: data_source_id required: true schema: title: Data source ID type: integer - description: Stream results as JSON in: query name: stream required: false schema: default: false description: Stream results as JSON title: Stream type: boolean - in: header name: X-Stream-Response required: false schema: anyOf: - type: string - type: 'null' title: X-Stream-Response requestBody: content: application/json: schema: $ref: '#/components/schemas/ApiQueryRequest' required: true responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiQueryResult' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ApiQueryRequest: properties: params: anyOf: - items: type: string type: array - type: 'null' description: Positional parameters for parameterized queries title: Params query: description: SQL query to execute title: Query type: string required: - query title: ApiQueryRequest type: object ApiQueryResult: properties: columns: anyOf: - items: $ref: '#/components/schemas/ApiQueryColumn' type: array - type: 'null' title: Columns rows: items: additionalProperties: true type: object title: Rows type: array required: - rows title: ApiQueryResult type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ApiQueryColumn: properties: db_type: anyOf: - type: string - type: 'null' title: Db Type description: anyOf: - type: string - type: 'null' title: Description is_nullable: anyOf: - type: boolean - type: 'null' title: Is Nullable name: title: Name type: string number: anyOf: - type: integer - type: 'null' title: Number type: anyOf: - type: string - type: 'null' title: Type required: - name title: ApiQueryColumn type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/execute-custom-cypher-queries-against-the-lineage-graph.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Execute custom Cypher queries against the lineage graph > Execute custom Cypher queries for advanced lineage analysis. Allows running arbitrary Cypher queries against the Memgraph lineage database. Returns results in both tabular format and graph format (nodes and edges). WARNING: This is a power-user endpoint. All queries are logged for audit purposes. Use this for custom analysis beyond the standard lineage endpoints, such as: - Finding circular dependencies - Complex multi-hop patterns - Aggregation queries across lineage paths - Custom graph algorithms ## OpenAPI ````yaml openapi-public.json post /api/v1/lineagev2/cypher openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineagev2/cypher: post: tags: - lineagev2 summary: Execute custom Cypher queries against the lineage graph description: >- Execute custom Cypher queries for advanced lineage analysis. Allows running arbitrary Cypher queries against the Memgraph lineage database. Returns results in both tabular format and graph format (nodes and edges). WARNING: This is a power-user endpoint. All queries are logged for audit purposes. Use this for custom analysis beyond the standard lineage endpoints, such as: - Finding circular dependencies - Complex multi-hop patterns - Aggregation queries across lineage paths - Custom graph algorithms operationId: lineagev2_run_cypher requestBody: content: application/json: schema: $ref: '#/components/schemas/CypherRequest' required: true responses: '200': content: application/json: schema: $ref: '#/components/schemas/CypherResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: CypherRequest: properties: query: title: Query type: string required: - query title: CypherRequest type: object CypherResponse: properties: columns: items: type: string title: Columns type: array edges: items: $ref: '#/components/schemas/CypherEdge' title: Edges type: array nodes: items: $ref: '#/components/schemas/CypherNode' title: Nodes type: array results: items: additionalProperties: true type: object title: Results type: array required: - columns - results - nodes - edges title: CypherResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object CypherEdge: properties: id: title: Id type: string properties: additionalProperties: true title: Properties type: object source: title: Source type: string target: title: Target type: string type: title: Type type: string required: - id - source - target - type - properties title: CypherEdge type: object CypherNode: properties: id: title: Id type: string labels: items: type: string title: Labels type: array properties: additionalProperties: true title: Properties type: object required: - id - labels - properties title: CypherNode type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/support/faq-redirect.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Overview > Get answers to the most common questions regarding our product. Have a question that isn’t answered here? Feel free to reach out to us at [support@datafold.com](mailto:support@datafold.com), and we’ll be happy to assist you! --- # Source: https://docs.datafold.com/data-diff/file-diffing.md # File Diffing > Datafold allows you to diff files (e.g. CSV, Excel, Parquet, etc.) in a similar way to how you diff tables. If you'd like to enable file diffing for your organization, please contact [support@datafold.com](mailto:support@datafold.com). In addition to diffing data in tables, views, and SQL queries, Datafold allows you to diff data in files hosted in cloud storage. For example, you can diff between an Excel file and a Snowflake table, or between a CSV file and an Excel file. ## Supported cloud storage providers Datafold supports diffing files in the following cloud storage providers, with plans to support more in the future: * Amazon S3 * Azure Blob Storage * Azure Data Lake Storage (ADLS) * Google Cloud Storage ## Supported file types Datafold supports diffing the following file types: * Tabular text files (e.g. `.csv`, `.tsv`, `.txt`, `.dat`) * Excel (`.xlsx`, `.xls`) * Parquet (`.parquet`) ## Type-specific options Depending on the type of file you're diffing, you'll have a few options to specify how you'd like to parse the file. For example, when diffing a tabular text file, you can specify the delimiter and skip header/footer rows. File diffing options --- # Source: https://docs.datafold.com/deployment-testing/getting-started/universal/fully-automated.md # Fully-Automated > Automatically diff tables modified in a pull request with Datafold's Fully-Automated CI integration. Our Fully-Automated CI integration enables you to automatically diff tables modified in a pull request so you know exactly how your data will change before going to production. We do this by analyzing the SQL in any changed files, extracting the relevant table names, and diffing those tables between your staging and production environments. We then post the results of those diffs—including any downstream impact—to your pull request for all to see. All without manual intervention. ## Prerequisites * Your code must be hosted in one of our supported version control integrations * Your tables/views must be defined in SQL * Your schema names must be parameterized ([see below](#4-parameterize-schema-names)) * You must be automatically generating staging data ([more info](/deployment-testing/how-it-works)) ## Get Started Get started in just a few easy steps. ### 1. Generate a Datafold API key If you haven't already generated an API key (you only need one), visit Settings > Account and select **Create API Key**. Save the key somewhere safe like a password manager, as you won't be able to view it later. ### 2. Set up a version control integration Open the Datafold app and navigate to Settings > Integrations > Repositories to connect the repository that contains the code you'd like to automatically diff. ### 3. Add a step to your CI workflow This example assumes you're using GitHub actions, but the approach generalizes to any version control tool we support including GitLab, Bitbucket, etc. Either [create a new GitHub Action](https://docs.github.com/en/actions/writing-workflows/quickstart) or add the following steps to an existing one: ```yaml theme={null} - name: Install datafold-sdk run: pip install -q datafold-sdk - name: Trigger Datafold CI run: | datafold ci auto trigger --ci-config-id $CI_CONF_ID --pr-num $PR_NUM --base-sha $BASE_SHA --pr-sha $PR_SHA --reference-params "$REFERENCE_PARAMS" --pr-params "$PR_PARAMS" env: DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} CI_CONF_ID: 436 PR_NUM: "${{ steps.findPr.outputs.pr }}" PR_SHA: "${{ github.event.pull_request.head.sha }}" BASE_SHA: ${{ github.event.pull_request.base.sha }} REFERENCE_PARAMS: '{ "target_schema": "nc_default" }' PR_PARAMS: "{ \"target_schema\": \"${{ env.TARGET_SCHEMA }}\" }" ``` ### 4. Parameterize schema names If it's not already the case, you'll need to parameterize the schema for any table paths you'd like Datafold to diff. For example, let's say you have a file called `dim_orgs.sql` that defines a table called `DIM_ORGS` in your warehouse. Your SQL should look something like this: ```sql theme={null} -- datafold: pk=org_id CREATE OR REPLACE TABLE analytics.${target_schema}.dim_orgs AS ( SELECT org_id, org_name, employee_count, created_at FROM analytics.${target_schema}.org_created ); ``` ### 5. Provide primary keys (optional) While this step is technically optional, we strongly recommend providing primary keys for any tables you'd like Datafold to diff. In order for Datafold to perform full value-level comparisons between staging and production tables, Datafold needs to know the primary keys. To provide this information, place a comment above each query using the `-- datafold: pk=` syntax shown below: ```sql theme={null} -- datafold: pk=org_id CREATE OR REPLACE TABLE analytics.${target_schema}.dim_orgs AS ( SELECT org_id, ... ``` ### 6. Create a pull request When you create a pull request, Datafold will automatically detect it, attempt to diff any tables modified in the code, and post a summary as a comment in the PR. You can click through on the comment to view a more complete analysis of the changes in the Datafold app. Happy diffing! ## Need help? If you have any questions about Fully-Automated CI, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com). --- # Source: https://docs.datafold.com/datafold-deployment/dedicated-cloud/gcp.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Datafold VPC Deployment on GCP > Learn how to deploy Datafold in a Virtual Private Cloud (VPC) on GCP. **INFO** VPC deployments are an Enterprise feature. Please email [sales@datafold.com](mailto:sales@datafold.com) to enable your account. ## Create a Domain Name (optional) You can either choose to use your domain (for example, `datafold.domain.tld`) or to use a Datafold managed domain (for example, `yourcompany.dedicated.datafold.com`). ### Customer Managed Domain Name Create a DNS A-record for the domain where Datafold will be hosted. For the DNS record, there are two options: * **Public-facing:** When the domain is publicly available, we will provide an SSL certificate for the endpoint. * **Internal:** It is also possible to have Datafold disconnected from the internet. This would require an internal DNS (for example, AWS Route 53) record that points to the Datafold instance. It is possible to provide your own certificate for setting up the SSL connection. Once the deployment is complete, you will point that A-record to the IP address of the Datafold service. ## Create a New Project For isolation reasons, it is best practice to [create a new project](https://console.cloud.google.com/projectcreate) within your GCP organization. Please call it something like `yourcompany-datafold` to make it easy to identify: After a minute or so, you should receive confirmation that the project has been created. Afterward, you should be able to see the new project. ## Set IAM Permissions Navigate to the **IAM** tab in the sidebar and click **Grant Access** to invite Datafold to the project. Add your Datafold solutions engineer as a **principal**. You have two options for assigning IAM permissions to the Datafold Engineers. 1. Assign them as an **owner** of your project. 2. Assign the extended set of [Minimal IAM Permissions](#minimal-iam-permissions). The owner role is only required temporarily while we configure and test the initial Datafold deployment. We'll inform you when it is ok to revoke this permission and provide us with only the [Minimal IAM Permissions](#minimal-iam-permissions). ### Required APIs The following GCP APIs need to be additionally enabled to run Datafold: 1. [Compute Engine API](https://console.cloud.google.com/apis/library/compute.googleapis.com) 2. [Secret Manager API](https://console.cloud.google.com/apis/api/secretmanager.googleapis.com) The following GCP APIs we use are already turned on by default when you created the project: 1. [Cloud Logging API](https://console.cloud.google.com/apis/api/logging.googleapis.com) 2. [Cloud Monitoring API](https://console.cloud.google.com/apis/api/monitoring.googleapis.com) 3. [Cloud Storage](https://console.cloud.google.com/apis/api/storage-component.googleapis.com) 4. [Service Networking API](https://console.cloud.google.com/apis/api/servicenetworking.googleapis.com) Once the access has been granted, make sure to notify Datafold so we can initiate the deployment. ### Minimal IAM Permissions Because we work in a Project dedicated to Datafold, there is no direct access to your resources unless explicitly configured (e.g., VPC Peering). The following IAM roles are required to update and maintain the infrastructure. ```Bash theme={null} Cloud SQL Admin Compute Load Balancer Admin Compute Network Admin Compute Security Admin Compute Storage Admin IAP-secured Tunnel User Kubernetes Engine Admin Kubernetes Engine Cluster Admin Role Viewer Service Account User Storage Admin Viewer ``` Some roles we need from time to time. For example, when we do the first deployment. Since those are IAM-related, we will ask for temporary permissions when required. ```Bash theme={null} Role Administrator Security Admin Service Account Key Admin Service Account Admin Service Usage Admin ``` # Datafold Google Cloud infrastructure details This document provides detailed information about the Google Cloud infrastructure components deployed by the Datafold Terraform module, explaining the architectural decisions and operational considerations for each component. ## Persistent disks The Datafold application requires 3 persistent disks for storage, each deployed as encrypted Google Compute Engine persistent disks in the primary availability zone. This also means that pods cannot be deployed outside the availability zone of these disks, because the nodes wouldn't be able to attach them. **ClickHouse data disk** serves as the analytical database storage for Datafold. ClickHouse is a columnar database that excels at analytical queries. The default 40GB allocation usually provides sufficient space for typical deployments, but it can be scaled up based on data volume requirements. The pd-balanced disk type provides consistent performance for analytical workloads with automatically managed IOPS and throughput. **ClickHouse logs disk** stores ClickHouse's internal logs and temporary data. The separate logs disk prevents log data from consuming IOPS and I/O performance from actual data storage. **Redis data disk** provides persistent storage for Redis, which handles task distribution and distributed locks in the Datafold application. Redis is memory-first but benefits from persistence for data durability across restarts. The 50GB default size accommodates typical caching needs while remaining cost-effective. All persistent disks are encrypted by default using Google-managed encryption keys, ensuring data security at rest. The disks are deployed in the first availability zone to minimize latency and simplify backup strategies. ## Load balancer The load balancer serves as the primary entry point for all external traffic to the Datafold application. The module offers 2 deployment strategies, each with different operational characteristics and trade-offs. **External Load Balancer Deployment** (the default approach) creates a Google Cloud Load Balancer through Terraform. This approach provides centralized control over load balancer configuration and integrates well with existing Google Cloud infrastructure. The load balancer automatically handles SSL termination, health checks, and traffic distribution across Kubernetes pods. This method is ideal for organizations that prefer infrastructure-as-code management and want consistent load balancer configurations across environments. **Kubernetes-Managed Load Balancer** deployment sets `deploy_lb = false` and relies on the Google Cloud Load Balancer Controller running within the GKE cluster. This approach leverages Kubernetes-native load balancer management, allowing for dynamic scaling and easier integration with Kubernetes ingress resources. The controller automatically provisions and manages load balancers based on Kubernetes service definitions, which can be more flexible for applications that need to scale load balancer resources dynamically. For external load balancers deployed through Kubernetes, the infrastructure developer needs to create SSL policies and Cloud Armor policies separately and attach them to the load balancer through annotations. Internal load balancers cannot have SSL policies or Cloud Armor applied. Our Helm charts support various deployment types including internal/external load balancers with uploaded certificates or certificates stored in Kubernetes secrets. The choice between these approaches often depends on operational preferences and existing infrastructure patterns. External deployment provides more predictable resource management, while Kubernetes-managed deployment offers greater flexibility for dynamic workloads. **Security** A firewall rule shared between the load balancer and the GKE nodes allows traffic to reach only the GKE nodes and nothing else. The load balancer allows traffic to land directly into the GKE private subnet. **Certificate** The certificate can be pre-created by the customer and then attached, or a Google-managed SSL certificate can be created on the fly. The application will not function without HTTPS, so a certificate is mandatory. After the certificate is created either manually or through this repository, it must be validated by the DNS administrator by adding an A record. This puts the certificate in "ACTIVE" state. The certificate cannot be found when it's still provisioning. ## GKE cluster The Google Kubernetes Engine (GKE) cluster forms the compute foundation for the Datafold application, providing a managed Kubernetes environment optimized for Google Cloud infrastructure. **Network Architecture** The entire cluster is deployed into private subnets. This means the data plane is not reachable from the Internet except through the load balancer. A Cloud NAT allows the cluster to reach the internet (egress traffic) for downloading pod images, optionally sending Datadog logs and metrics, and retrieving the version to apply to the cluster from our portal. The control plane is accessible via a private endpoint using a Private Service Connect setup from, for example, a VPN VPC elsewhere. This is a private+public endpoint, so the control plane can also be made accessible through the Internet, but then the appropriate CIDR restrictions should be put in place. For a typical dedicated cloud deployment of Datafold, only around 100 IPs are needed. This assumes 3 e2-standard-8 instances where one node runs ClickHouse+Redis, another node runs the application, and a third node may be put in place when version rollovers occur. This means a subnet of size /24 (253 IPs) should be sufficient to run this application, but you can always apply a different CIDR per subnet if needed. By default, the repository creates a VPC and subnets, but by specifying the VPC ID of an already existing VPC, the cluster and load balancer get deployed into existing network infrastructure. This is important for some customers where they deploy a different architecture without Cloud NAT, firewall options that check egress, and other DLP controls. **Add-ons** The cluster includes essential add-ons like CoreDNS for service discovery, the VPC-native networking for networking, and the GCE persistent disk CSI driver for persistent volume management. These components are automatically updated and maintained by Google, reducing operational overhead. **Node Management** supports up to three managed node pools, allowing for workload-specific resource allocation. Each node pool can be configured with different machine types, enabling cost optimization and performance tuning for different application components. The cluster autoscaler automatically adjusts node count based on resource demands, ensuring efficient resource utilization while maintaining application availability. One typical way to deploy is to let the application pods go on a wider range of nodes, and set up tolerations and labels on the second node pool, which are then selected by both Redis and ClickHouse. This is because Redis and ClickHouse have restrictions on the zone they must be present in because of their disks, and ClickHouse is a bit more CPU intensive. This method optimizes CPU performance for the Datafold application. **Security Features** include several critical security configurations: * **Workload Identity** is enabled and configured with the project's workload pool, providing fine-grained IAM permissions to Kubernetes pods without requiring Google Cloud credentials in container images * **Shielded nodes** are enabled with secure boot and integrity monitoring for enhanced node security * **Binary authorization** is configured with project singleton policy enforcement to ensure only authorized container images can be deployed * **Network policy** is enabled using Calico for pod-to-pod communication control * **Private nodes** are enabled, ensuring all node traffic goes through the VPC network These security features follow the principle of least privilege and integrate seamlessly with Google Cloud security services. ## IAM roles and permissions The IAM architecture follows the principle of least privilege, providing specific permissions only where needed. Service accounts in Kubernetes are mapped to IAM roles using Workload Identity, enabling secure access to Google Cloud services without embedding credentials in application code. **GKE service account** is created with basic permissions for logging, monitoring, and storage access. This service account is used by the GKE nodes and provides the foundation for cluster operations. **ClickHouse backup service account** is created with a custom role that allows ClickHouse to make backups and store them on Cloud Storage. This service account uses Workload Identity to securely access Cloud Storage without embedding credentials. **Datafold roles** Datafold has roles per pod pre-defined which can have their permissions assigned when they need them. At the moment, we have two specific roles in use. One is for the ClickHouse pod to be able to make backups and store them on Cloud Storage. The other is for the use of the Vertex AI service for our AI offering. These roles are automatically created and configured when the cluster is deployed, ensuring that the necessary permissions are in place for the cluster to function properly. The Datafold and ClickHouse service accounts authenticate using Workload Identity, which means these permissions are automatically rotated and managed by Google, reducing security risks associated with long-lived credentials. ## Cloud SQL database The PostgreSQL Cloud SQL instance serves as the primary relational database for the Datafold application, storing user data, configuration, and application state. **Storage configuration** starts with a 20GB initial allocation that can automatically scale up to 100GB based on usage patterns. This auto-scaling feature prevents storage-related outages while avoiding over-provisioning. For typical deployments, storage usage remains under 200GB, though some high-volume deployments may approach 400GB. The pd-balanced storage type provides consistent performance with configurable IOPS and throughput. **High availability** is intentionally disabled by default, meaning the database runs in a single availability zone. This configuration reduces costs and complexity while still providing excellent reliability. The database includes automated backups with 7-day retention, ensuring data can be recovered in case of failures. For organizations requiring higher availability, multi-zone deployment can be enabled, though this significantly increases costs. **Security and encryption** always encrypts data at rest using Google-managed encryption keys by default. The database is deployed in private subnets with firewall rules that restrict access to only the GKE cluster, ensuring network-level security. The database configuration prioritizes operational simplicity and cost-effectiveness while maintaining the security and reliability required for production workloads. The combination of automated backups, encryption, and network isolation provides a robust foundation for the application's data storage needs. --- # Source: https://docs.datafold.com/api-reference/data-diffs/get-a-data-diff-summary.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get a data diff summary ## OpenAPI ````yaml get /api/v1/datadiffs/{datadiff_id}/summary_results openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/datadiffs/{datadiff_id}/summary_results: get: tags: - Data diffs summary: Get a data diff summary operationId: get_diff_summary_v1_api_v1_datadiffs__datadiff_id__summary_results_get parameters: - in: path name: datadiff_id required: true schema: title: Data diff id type: integer responses: '200': content: application/json: schema: anyOf: - $ref: '#/components/schemas/ApiDataDiffSummaryForDone' - $ref: '#/components/schemas/ApiCrossDataDiffSummaryForDone' - $ref: '#/components/schemas/ApiDataDiffSummaryForFailed' - $ref: '#/components/schemas/ApiDataDiffSummaryForRunning' - $ref: '#/components/schemas/InternalApiDataDiffDependencies' title: >- Response Get Diff Summary V1 Api V1 Datadiffs Datadiff Id Summary Results Get description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ApiDataDiffSummaryForDone: properties: dependencies: items: $ref: '#/components/schemas/ApiCIDependency' title: Dependencies type: array materialized_results: $ref: '#/components/schemas/ApiMaterializedResults' description: Results of the diff, materialized into tables. pks: $ref: '#/components/schemas/ApiDataDiffSummaryPKs' schema: $ref: '#/components/schemas/ApiDataDiffSummarySchema' status: enum: - done - success title: Status type: string values: anyOf: - $ref: '#/components/schemas/ApiDataDiffSummaryValues' - type: 'null' required: - status - pks - dependencies - schema - materialized_results title: ApiDataDiffSummaryForDone type: object ApiCrossDataDiffSummaryForDone: properties: pks: anyOf: - $ref: '#/components/schemas/ApiDataDiffSummaryPKs' - type: 'null' status: enum: - done - success title: Status type: string values: anyOf: - $ref: '#/components/schemas/ApiDataDiffSummaryValues' - type: 'null' required: - status title: ApiCrossDataDiffSummaryForDone type: object ApiDataDiffSummaryForFailed: properties: error: anyOf: - $ref: '#/components/schemas/ApiDataDiffError' - additionalProperties: true type: object title: Error status: const: failed title: Status type: string required: - status - error title: ApiDataDiffSummaryForFailed type: object ApiDataDiffSummaryForRunning: properties: status: enum: - running - pending title: Status type: string required: - status title: ApiDataDiffSummaryForRunning type: object InternalApiDataDiffDependencies: properties: dependencies: items: $ref: '#/components/schemas/ApiCIDependency' title: Dependencies type: array status: enum: - done - success title: Status type: string required: - status - dependencies title: InternalApiDataDiffDependencies type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ApiCIDependency: properties: data_source_id: title: Data Source Id type: integer data_source_type: title: Data Source Type type: string item_type: title: Item Type type: string name: title: Name type: string path: items: type: string title: Path type: array popularity: anyOf: - type: integer - type: 'null' title: Popularity primary_key: anyOf: - type: string - type: 'null' title: Primary Key query_type: anyOf: - type: string - type: 'null' title: Query Type raw_sql: anyOf: - type: string - type: 'null' title: Raw Sql remote_id: anyOf: - type: string - type: 'null' title: Remote Id table_name: anyOf: - type: string - type: 'null' title: Table Name uid: title: Uid type: string required: - uid - item_type - name - path - data_source_id - data_source_type title: ApiCIDependency type: object ApiMaterializedResults: properties: diff: anyOf: - items: $ref: '#/components/schemas/ApiMaterializedResult' type: array - type: 'null' description: >- Results of row-to-row comparison between dataset A and B. Semantics is the same as for `exclusive_pks1` field. title: Diff duplicates1: anyOf: - items: $ref: '#/components/schemas/ApiMaterializedResult' type: array - type: 'null' description: >- Rows with duplicate primary keys detected in dataset A. Semantics is the same as for `exclusive_pks1` field. title: Duplicates1 duplicates2: anyOf: - items: $ref: '#/components/schemas/ApiMaterializedResult' type: array - type: 'null' description: >- Rows with duplicate primary keys detected in dataset B. Semantics is the same as for `exclusive_pks1` field. title: Duplicates2 exclusives: anyOf: - items: $ref: '#/components/schemas/ApiMaterializedResult' type: array - type: 'null' description: >- Rows with exclusive primary keys detected in dataset A and B. `None` if table is not ready yet or if materialization wasn't requested. If materialization is completed, for a diff inside a single database the field will contain a list with one element. If diff compares tables in different databases, the list may contain one or two entries. title: Exclusives title: ApiMaterializedResults type: object ApiDataDiffSummaryPKs: properties: distincts: maxItems: 2 minItems: 2 prefixItems: - type: integer - type: integer title: Distincts type: array dupes: maxItems: 2 minItems: 2 prefixItems: - type: integer - type: integer title: Dupes type: array exclusives: maxItems: 2 minItems: 2 prefixItems: - type: integer - type: integer title: Exclusives type: array nulls: maxItems: 2 minItems: 2 prefixItems: - type: integer - type: integer title: Nulls type: array total_rows: maxItems: 2 minItems: 2 prefixItems: - type: integer - type: integer title: Total Rows type: array required: - total_rows - nulls - dupes - exclusives - distincts title: ApiDataDiffSummaryPKs type: object ApiDataDiffSummarySchema: properties: column_counts: maxItems: 2 minItems: 2 prefixItems: - type: integer - type: integer title: Column Counts type: array column_reorders: title: Column Reorders type: integer column_type_differs: items: type: string title: Column Type Differs type: array column_type_mismatches: title: Column Type Mismatches type: integer columns_mismatched: maxItems: 2 minItems: 2 prefixItems: - type: integer - type: integer title: Columns Mismatched type: array exclusive_columns: items: items: type: string type: array title: Exclusive Columns type: array required: - columns_mismatched - column_type_mismatches - column_reorders - column_counts - column_type_differs - exclusive_columns title: ApiDataDiffSummarySchema type: object ApiDataDiffSummaryValues: properties: columns_diff_stats: items: additionalProperties: anyOf: - type: number - type: string type: object title: Columns Diff Stats type: array columns_with_differences: title: Columns With Differences type: integer compared_columns: title: Compared Columns type: integer rows_with_differences: title: Rows With Differences type: integer total_rows: title: Total Rows type: integer total_values: title: Total Values type: integer values_with_differences: title: Values With Differences type: integer required: - total_rows - rows_with_differences - total_values - values_with_differences - compared_columns - columns_with_differences - columns_diff_stats title: ApiDataDiffSummaryValues type: object ApiDataDiffError: properties: error_type: title: Error Type type: string error_value: title: Error Value type: string required: - error_type - error_value title: ApiDataDiffError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object ApiMaterializedResult: properties: data_source_id: description: Id of the DataSource where the table is located title: Data Source Id type: integer is_sampled: description: If sampling was applied title: Is Sampled type: boolean path: description: Path segments of the table items: type: string title: Path type: array required: - data_source_id - path - is_sampled title: ApiMaterializedResult type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/data-diffs/get-a-data-diff.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get a data diff ## OpenAPI ````yaml get /api/v1/datadiffs/{datadiff_id} openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/datadiffs/{datadiff_id}: get: tags: - Data diffs summary: Get a data diff operationId: get_datadiff_api_v1_datadiffs__datadiff_id__get parameters: - in: path name: datadiff_id required: true schema: title: Data diff id type: integer - in: query name: poll required: false schema: title: Poll responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiDataDiffWithProgressState' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ApiDataDiffWithProgressState: properties: affected_columns: anyOf: - items: type: string type: array - type: 'null' title: Affected Columns algorithm: anyOf: - $ref: '#/components/schemas/DiffAlgorithm' - type: 'null' archived: default: false title: Archived type: boolean bisection_factor: anyOf: - type: integer - type: 'null' title: Bisection Factor bisection_threshold: anyOf: - type: integer - type: 'null' title: Bisection Threshold ci_base_branch: anyOf: - type: string - type: 'null' title: Ci Base Branch ci_pr_branch: anyOf: - type: string - type: 'null' title: Ci Pr Branch ci_pr_num: anyOf: - type: integer - type: 'null' title: Ci Pr Num ci_pr_sha: anyOf: - type: string - type: 'null' title: Ci Pr Sha ci_pr_url: anyOf: - type: string - type: 'null' title: Ci Pr Url ci_pr_user_display_name: anyOf: - type: string - type: 'null' title: Ci Pr User Display Name ci_pr_user_email: anyOf: - type: string - type: 'null' title: Ci Pr User Email ci_pr_user_id: anyOf: - type: string - type: 'null' title: Ci Pr User Id ci_pr_username: anyOf: - type: string - type: 'null' title: Ci Pr Username ci_run_id: anyOf: - type: integer - type: 'null' title: Ci Run Id ci_sha_url: anyOf: - type: string - type: 'null' title: Ci Sha Url column_mapping: anyOf: - items: maxItems: 2 minItems: 2 prefixItems: - type: string - type: string type: array type: array - type: 'null' title: Column Mapping columns_to_compare: anyOf: - items: type: string type: array - type: 'null' title: Columns To Compare compare_duplicates: anyOf: - type: boolean - type: 'null' title: Compare Duplicates created_at: anyOf: - format: date-time type: string - type: 'null' title: Created At data_app_metadata: anyOf: - $ref: '#/components/schemas/TDataDiffDataAppMetadata' - type: 'null' data_app_type: anyOf: - type: string - type: 'null' title: Data App Type data_source1_id: title: Data Source1 Id type: integer data_source1_session_parameters: anyOf: - additionalProperties: true type: object - type: 'null' title: Data Source1 Session Parameters data_source2_id: title: Data Source2 Id type: integer data_source2_session_parameters: anyOf: - additionalProperties: true type: object - type: 'null' title: Data Source2 Session Parameters datetime_tolerance: anyOf: - type: integer - type: 'null' title: Datetime Tolerance diff_progress: anyOf: - $ref: '#/components/schemas/DiffProgress' - type: 'null' diff_stats: anyOf: - $ref: '#/components/schemas/DiffStats' - type: 'null' diff_tolerance: anyOf: - type: number - type: 'null' title: Diff Tolerance diff_tolerances_per_column: anyOf: - items: $ref: '#/components/schemas/ColumnTolerance' type: array - type: 'null' title: Diff Tolerances Per Column done: anyOf: - type: boolean - type: 'null' default: false title: Done download_limit: anyOf: - type: integer - type: 'null' title: Download Limit exclude_columns: anyOf: - items: type: string type: array - type: 'null' title: Exclude Columns execute_as_user: anyOf: - type: boolean - type: 'null' title: Execute As User file1: anyOf: - format: uri minLength: 1 type: string - type: 'null' title: File1 file1_options: anyOf: - discriminator: mapping: csv: '#/components/schemas/CSVFileOptions' excel: '#/components/schemas/ExcelFileOptions' parquet: '#/components/schemas/ParquetFileOptions' propertyName: file_type oneOf: - $ref: '#/components/schemas/CSVFileOptions' - $ref: '#/components/schemas/ExcelFileOptions' - $ref: '#/components/schemas/ParquetFileOptions' - type: 'null' title: File1 Options file2: anyOf: - format: uri minLength: 1 type: string - type: 'null' title: File2 file2_options: anyOf: - discriminator: mapping: csv: '#/components/schemas/CSVFileOptions' excel: '#/components/schemas/ExcelFileOptions' parquet: '#/components/schemas/ParquetFileOptions' propertyName: file_type oneOf: - $ref: '#/components/schemas/CSVFileOptions' - $ref: '#/components/schemas/ExcelFileOptions' - $ref: '#/components/schemas/ParquetFileOptions' - type: 'null' title: File2 Options filter1: anyOf: - type: string - type: 'null' title: Filter1 filter2: anyOf: - type: string - type: 'null' title: Filter2 finished_at: anyOf: - format: date-time type: string - type: 'null' title: Finished At id: anyOf: - type: integer - type: 'null' title: Id include_columns: anyOf: - items: type: string type: array - type: 'null' title: Include Columns kind: $ref: '#/components/schemas/DiffKind' materialization_destination_id: anyOf: - type: integer - type: 'null' title: Materialization Destination Id materialize_dataset1: anyOf: - type: boolean - type: 'null' title: Materialize Dataset1 materialize_dataset2: anyOf: - type: boolean - type: 'null' title: Materialize Dataset2 materialize_without_sampling: anyOf: - type: boolean - type: 'null' default: false title: Materialize Without Sampling monitor_error: anyOf: - $ref: '#/components/schemas/QueryError' - type: 'null' monitor_id: anyOf: - type: integer - type: 'null' title: Monitor Id monitor_state: anyOf: - $ref: '#/components/schemas/MonitorRunState' - type: 'null' per_column_diff_limit: anyOf: - type: integer - type: 'null' title: Per Column Diff Limit pk_columns: items: type: string title: Pk Columns type: array purged: default: false title: Purged type: boolean query1: anyOf: - type: string - type: 'null' title: Query1 query2: anyOf: - type: string - type: 'null' title: Query2 result: anyOf: - enum: - error - bad-pks - different - missing-pks - identical - empty type: string - type: 'null' title: Result result_revisions: additionalProperties: type: integer default: {} title: Result Revisions type: object result_statuses: anyOf: - additionalProperties: type: string type: object - type: 'null' title: Result Statuses run_profiles: anyOf: - type: boolean - type: 'null' title: Run Profiles runtime: anyOf: - type: number - type: 'null' title: Runtime sampling_confidence: anyOf: - type: number - type: 'null' title: Sampling Confidence sampling_max_rows: anyOf: - type: integer - type: 'null' title: Sampling Max Rows sampling_ratio: anyOf: - type: number - type: 'null' title: Sampling Ratio sampling_threshold: anyOf: - type: integer - type: 'null' title: Sampling Threshold sampling_tolerance: anyOf: - type: number - type: 'null' title: Sampling Tolerance source: anyOf: - $ref: '#/components/schemas/JobSource' - type: 'null' status: anyOf: - $ref: '#/components/schemas/JobStatus' - type: 'null' table1: anyOf: - items: type: string type: array - type: 'null' title: Table1 table2: anyOf: - items: type: string type: array - type: 'null' title: Table2 table_modifiers: anyOf: - items: $ref: '#/components/schemas/TableModifiers' type: array - type: 'null' title: Table Modifiers tags: anyOf: - items: type: string type: array - type: 'null' title: Tags temp_schema_override: anyOf: - items: type: string type: array - type: 'null' title: Temp Schema Override time_aggregate: anyOf: - $ref: '#/components/schemas/TimeAggregateEnum' - type: 'null' time_column: anyOf: - type: string - type: 'null' title: Time Column time_interval_end: anyOf: - format: date-time type: string - type: 'null' title: Time Interval End time_interval_start: anyOf: - format: date-time type: string - type: 'null' title: Time Interval Start time_travel_point1: anyOf: - type: integer - format: date-time type: string - type: string - type: 'null' title: Time Travel Point1 time_travel_point2: anyOf: - type: integer - format: date-time type: string - type: string - type: 'null' title: Time Travel Point2 tolerance_mode: anyOf: - $ref: '#/components/schemas/ToleranceModeEnum' - type: 'null' updated_at: anyOf: - format: date-time type: string - type: 'null' title: Updated At user_id: anyOf: - type: integer - type: 'null' title: User Id required: - data_source1_id - data_source2_id - pk_columns - kind title: ApiDataDiffWithProgressState type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object DiffAlgorithm: enum: - join - hash - hash_v2_alpha - fetch_and_join title: DiffAlgorithm type: string TDataDiffDataAppMetadata: properties: data_app_id: title: Data App Id type: integer data_app_model1_id: anyOf: - type: string - type: 'null' title: Data App Model1 Id data_app_model1_name: anyOf: - type: string - type: 'null' title: Data App Model1 Name data_app_model2_id: anyOf: - type: string - type: 'null' title: Data App Model2 Id data_app_model2_name: anyOf: - type: string - type: 'null' title: Data App Model2 Name data_app_model_type: title: Data App Model Type type: string meta_data: additionalProperties: true title: Meta Data type: object required: - data_app_id - data_app_model_type - meta_data title: TDataDiffDataAppMetadata type: object DiffProgress: properties: completed_steps: anyOf: - type: integer - type: 'null' title: Completed Steps total_steps: anyOf: - type: integer - type: 'null' title: Total Steps version: title: Version type: string required: - version title: DiffProgress type: object DiffStats: properties: diff_duplicate_pks: anyOf: - type: number - type: 'null' title: Diff Duplicate Pks diff_null_pks: anyOf: - type: number - type: 'null' title: Diff Null Pks diff_pks: anyOf: - type: number - type: 'null' title: Diff Pks diff_rows: anyOf: - type: number - type: 'null' title: Diff Rows diff_rows_count: anyOf: - type: integer - type: 'null' title: Diff Rows Count diff_rows_number: anyOf: - type: number - type: 'null' title: Diff Rows Number diff_schema: anyOf: - type: number - type: 'null' title: Diff Schema diff_values: anyOf: - type: number - type: 'null' title: Diff Values errors: anyOf: - type: integer - type: 'null' title: Errors match_ratio: anyOf: - type: number - type: 'null' title: Match Ratio rows_added: anyOf: - type: integer - type: 'null' title: Rows Added rows_removed: anyOf: - type: integer - type: 'null' title: Rows Removed sampled: anyOf: - type: boolean - type: 'null' title: Sampled table_a_row_count: anyOf: - type: integer - type: 'null' title: Table A Row Count table_b_row_count: anyOf: - type: integer - type: 'null' title: Table B Row Count version: title: Version type: string required: - version title: DiffStats type: object ColumnTolerance: properties: column_name: title: Column Name type: string tolerance_mode: $ref: '#/components/schemas/ToleranceModeEnum' tolerance_value: title: Tolerance Value type: number required: - column_name - tolerance_value - tolerance_mode title: ColumnTolerance type: object CSVFileOptions: properties: delimiter: anyOf: - type: string - type: 'null' title: Delimiter file_type: const: csv default: csv title: File Type type: string skip_head_rows: anyOf: - type: integer - type: 'null' title: Skip Head Rows skip_tail_rows: anyOf: - type: integer - type: 'null' title: Skip Tail Rows title: CSVFileOptions type: object ExcelFileOptions: properties: file_type: const: excel default: excel title: File Type type: string sheet: anyOf: - type: string - type: 'null' title: Sheet skip_head_rows: anyOf: - type: integer - type: 'null' title: Skip Head Rows skip_tail_rows: anyOf: - type: integer - type: 'null' title: Skip Tail Rows title: ExcelFileOptions type: object ParquetFileOptions: properties: file_type: const: parquet default: parquet title: File Type type: string title: ParquetFileOptions type: object DiffKind: enum: - in_db - cross_db title: DiffKind type: string QueryError: properties: error_type: title: Error Type type: string error_value: title: Error Value type: string required: - error_type - error_value title: QueryError type: object MonitorRunState: enum: - ok - alert - error - learning - checking - created - skipped - cancelled title: MonitorRunState type: string JobSource: enum: - interactive - demo_signup - manual - api - ci - schedule - auto title: JobSource type: string JobStatus: enum: - needs_confirmation - needs_authentication - waiting - processing - done - failed - cancelled title: JobStatus type: string TableModifiers: enum: - case_insensitive_strings title: TableModifiers type: string TimeAggregateEnum: enum: - minute - hour - day - week - month - year title: TimeAggregateEnum type: string ToleranceModeEnum: enum: - absolute - relative title: ToleranceModeEnum type: string ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/data-sources/get-a-data-source-summary.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get a data source summary ## OpenAPI ````yaml get /api/v1/data_sources/{data_source_id}/summary openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/data_sources/{data_source_id}/summary: get: tags: - Data sources summary: Get a data source summary operationId: get_data_source_summary_api_v1_data_sources__data_source_id__summary_get parameters: - in: path name: data_source_id required: true schema: title: Data source id type: integer responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiDataSourceSummary' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ApiDataSourceSummary: description: Used in OSS data-diff with non-admin privileges to get a DS overview. properties: id: title: Id type: integer name: title: Name type: string type: title: Type type: string required: - id - name - type title: ApiDataSourceSummary type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/data-sources/get-a-data-source.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get a data source ## OpenAPI ````yaml get /api/v1/data_sources/{data_source_id} openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/data_sources/{data_source_id}: get: tags: - Data sources summary: Get a data source operationId: get_data_source_api_v1_data_sources__data_source_id__get parameters: - in: path name: data_source_id required: true schema: title: Data source id type: integer responses: '200': content: application/json: schema: discriminator: mapping: athena: '#/components/schemas/ApiDataSourceAwsAthena' aws_s3: '#/components/schemas/ApiDataSourceS3' azure_synapse: '#/components/schemas/ApiDataSourceAzureSynapse' bigquery: '#/components/schemas/ApiDataSourceBigQuery' databricks: '#/components/schemas/ApiDataSourceDatabricks' dremio: '#/components/schemas/ApiDataSourceDremio' duckdb: '#/components/schemas/ApiDataSourceDuckDB' files_azure_datalake: '#/components/schemas/ApiDataSourceAzureDataLake' google_cloud_storage: '#/components/schemas/ApiDataSourceGCS' mariadb: '#/components/schemas/ApiDataSourceMariaDB' microsoft_fabric: '#/components/schemas/ApiDataSourceMicrosoftFabric' mongodb: '#/components/schemas/ApiDataSourceMongoDB' mssql: '#/components/schemas/ApiDataSourceMSSQL' mysql: '#/components/schemas/ApiDataSourceMySQL' netezza: '#/components/schemas/ApiDataSourceNetezza' oracle: '#/components/schemas/ApiDataSourceOracle' pg: '#/components/schemas/ApiDataSourcePostgres' postgres_aurora: '#/components/schemas/ApiDataSourcePostgresAurora' postgres_aws_rds: '#/components/schemas/ApiDataSourcePostgresRds' redshift: '#/components/schemas/ApiDataSourceRedshift' sap_hana: '#/components/schemas/ApiDataSourceSapHana' snowflake: '#/components/schemas/ApiDataSourceSnowflake' starburst: '#/components/schemas/ApiDataSourceStarburst' teradata: '#/components/schemas/ApiDataSourceTeradata' trino: '#/components/schemas/ApiDataSourceTrino' vertica: '#/components/schemas/ApiDataSourceVertica' propertyName: type oneOf: - $ref: '#/components/schemas/ApiDataSourceBigQuery' - $ref: '#/components/schemas/ApiDataSourceDatabricks' - $ref: '#/components/schemas/ApiDataSourceDuckDB' - $ref: '#/components/schemas/ApiDataSourceMongoDB' - $ref: '#/components/schemas/ApiDataSourceMySQL' - $ref: '#/components/schemas/ApiDataSourceMariaDB' - $ref: '#/components/schemas/ApiDataSourceMSSQL' - $ref: '#/components/schemas/ApiDataSourceOracle' - $ref: '#/components/schemas/ApiDataSourcePostgres' - $ref: '#/components/schemas/ApiDataSourcePostgresAurora' - $ref: '#/components/schemas/ApiDataSourcePostgresRds' - $ref: '#/components/schemas/ApiDataSourceRedshift' - $ref: '#/components/schemas/ApiDataSourceTeradata' - $ref: '#/components/schemas/ApiDataSourceSapHana' - $ref: '#/components/schemas/ApiDataSourceAwsAthena' - $ref: '#/components/schemas/ApiDataSourceSnowflake' - $ref: '#/components/schemas/ApiDataSourceDremio' - $ref: '#/components/schemas/ApiDataSourceStarburst' - $ref: '#/components/schemas/ApiDataSourceNetezza' - $ref: '#/components/schemas/ApiDataSourceAzureDataLake' - $ref: '#/components/schemas/ApiDataSourceGCS' - $ref: '#/components/schemas/ApiDataSourceS3' - $ref: '#/components/schemas/ApiDataSourceAzureSynapse' - $ref: '#/components/schemas/ApiDataSourceMicrosoftFabric' - $ref: '#/components/schemas/ApiDataSourceVertica' - $ref: '#/components/schemas/ApiDataSourceTrino' title: >- Response Get Data Source Api V1 Data Sources Data Source Id Get description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ApiDataSourceBigQuery: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/BigQueryConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: bigquery title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceBigQuery type: object ApiDataSourceDatabricks: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/DatabricksConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: databricks title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceDatabricks type: object ApiDataSourceDuckDB: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/DuckDBConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: duckdb title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceDuckDB type: object ApiDataSourceMongoDB: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/MongoDBConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: mongodb title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceMongoDB type: object ApiDataSourceMySQL: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/MySQLConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: mysql title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceMySQL type: object ApiDataSourceMariaDB: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/MariaDBConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: mariadb title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceMariaDB type: object ApiDataSourceMSSQL: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/MSSQLConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: mssql title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceMSSQL type: object ApiDataSourceOracle: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/OracleConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: oracle title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceOracle type: object ApiDataSourcePostgres: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/PostgreSQLConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: pg title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourcePostgres type: object ApiDataSourcePostgresAurora: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/PostgreSQLAuroraConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: postgres_aurora title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourcePostgresAurora type: object ApiDataSourcePostgresRds: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/PostgreSQLAuroraConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: postgres_aws_rds title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourcePostgresRds type: object ApiDataSourceRedshift: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/RedshiftConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: redshift title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceRedshift type: object ApiDataSourceTeradata: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/TeradataConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: teradata title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceTeradata type: object ApiDataSourceSapHana: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/SapHanaConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: sap_hana title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceSapHana type: object ApiDataSourceAwsAthena: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/AwsAthenaConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: athena title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceAwsAthena type: object ApiDataSourceSnowflake: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/SnowflakeConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: snowflake title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceSnowflake type: object ApiDataSourceDremio: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/DremioConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: dremio title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceDremio type: object ApiDataSourceStarburst: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/StarburstConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: starburst title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceStarburst type: object ApiDataSourceNetezza: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/NetezzaConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: netezza title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceNetezza type: object ApiDataSourceAzureDataLake: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/AzureDataLakeConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: files_azure_datalake title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceAzureDataLake type: object ApiDataSourceGCS: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/GCSConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: google_cloud_storage title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceGCS type: object ApiDataSourceS3: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/AWSS3Config' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: aws_s3 title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceS3 type: object ApiDataSourceAzureSynapse: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/MSSQLConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: azure_synapse title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceAzureSynapse type: object ApiDataSourceMicrosoftFabric: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/MicrosoftFabricConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: microsoft_fabric title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceMicrosoftFabric type: object ApiDataSourceVertica: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/VerticaConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: vertica title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceVertica type: object ApiDataSourceTrino: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/TrinoConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: trino title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceTrino type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ApiDataSourceTestStatus: properties: results: items: $ref: '#/components/schemas/TestResultStep' title: Results type: array tested_at: format: date-time title: Tested At type: string required: - tested_at - results title: ApiDataSourceTestStatus type: object BigQueryConfig: properties: extraProjectsToIndex: anyOf: - type: string - type: 'null' examples: - |- project1 project2 section: config title: List of extra projects to index (one per line) widget: multiline jsonKeyFile: format: password section: basic title: JSON Key File type: string writeOnly: true jsonOAuthKeyFile: anyOf: - format: password type: string writeOnly: true - type: 'null' section: basic title: JSON OAuth Key File location: default: US examples: - US section: basic title: Processing Location type: string projectId: section: basic title: Project ID type: string totalMBytesProcessedLimit: anyOf: - type: integer - type: 'null' section: config title: Scanned Data Limit (MB) useStandardSql: default: true section: config title: Use Standard SQL type: boolean userDefinedFunctionResourceUri: anyOf: - type: string - type: 'null' examples: - gs://bucket/date_utils.js section: config title: UDF Source URIs required: - projectId - jsonKeyFile title: BigQueryConfig type: object DatabricksConfig: properties: database: anyOf: - type: string - type: 'null' title: Database host: maxLength: 128 title: Host type: string http_password: format: password title: Access Token type: string writeOnly: true http_path: default: '' title: HTTP Path type: string oauth_dwh_client_id: anyOf: - type: string - type: 'null' title: Oauth Dwh Client Id oauth_dwh_client_secret: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Oauth Dwh Client Secret required: - host - http_password title: DatabricksConfig type: object DuckDBConfig: properties: {} title: DuckDBConfig type: object MongoDBConfig: properties: auth_source: anyOf: - type: string - type: 'null' default: admin title: Auth Source connect_timeout_ms: default: 60000 title: Connect Timeout Ms type: integer database: title: Database type: string host: maxLength: 128 title: Host type: string password: format: password title: Password type: string writeOnly: true port: default: 27017 title: Port type: integer server_selection_timeout_ms: default: 60000 title: Server Selection Timeout Ms type: integer socket_timeout_ms: default: 300000 title: Socket Timeout Ms type: integer username: title: Username type: string required: - database - username - password - host title: MongoDBConfig type: object MySQLConfig: properties: db: title: Database name type: string host: maxLength: 128 title: Host type: string password: format: password title: Password type: string writeOnly: true port: default: 3306 title: Port type: integer user: title: User type: string required: - host - user - password - db title: MySQLConfig type: object MariaDBConfig: description: |- Configuration for MariaDB connections. MariaDB is MySQL-compatible, so we reuse the MySQL configuration. Default port is 3306, same as MySQL. properties: db: title: Database name type: string host: maxLength: 128 title: Host type: string password: format: password title: Password type: string writeOnly: true port: default: 3306 title: Port type: integer user: title: User type: string required: - host - user - password - db title: MariaDBConfig type: object MSSQLConfig: properties: dbname: anyOf: - type: string - type: 'null' title: Dbname host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 1433 title: Port type: integer require_encryption: default: true title: Require Encryption type: boolean session_script: anyOf: - type: string - type: 'null' description: >- The script to execute on connection; e.g. ALTER SESSION SET CONTAINER = ... title: Init script trust_server_certificate: default: false title: Trust Server Certificate type: boolean user: default: DATAFOLD title: User type: string required: - host title: MSSQLConfig type: object OracleConfig: properties: database: anyOf: - type: string - type: 'null' title: Database database_type: anyOf: - enum: - service - sid type: string - type: 'null' title: Database Type ewallet_password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: EWallet password ewallet_pem_file: anyOf: - format: password type: string writeOnly: true - type: 'null' title: EWallet PEM ewallet_pkcs12_file: anyOf: - format: password type: string writeOnly: true - type: 'null' title: EWallet PKCS12 ewallet_type: anyOf: - enum: - x509 - pkcs12 type: string - type: 'null' title: Ewallet Type host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: anyOf: - type: integer - type: 'null' title: Port session_script: anyOf: - type: string - type: 'null' description: >- The script to execute on connection; e.g. ALTER SESSION SET CONTAINER = ... title: Init script ssl: default: false title: Ssl type: boolean ssl_server_dn: anyOf: - type: string - type: 'null' description: 'e.g. C=US,O=example,CN=db.example.com; default: CN=' title: Server's SSL DN user: default: DATAFOLD title: User type: string required: - host title: OracleConfig type: object PostgreSQLConfig: properties: dbname: title: Database Name type: string host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 5432 title: Port type: integer role: anyOf: - type: string - type: 'null' title: Role (case sensitive) rootcert: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Root certificate sslmode: $ref: '#/components/schemas/SslMode' default: prefer title: SSL Mode user: title: User type: string required: - host - user - dbname title: PostgreSQLConfig type: object PostgreSQLAuroraConfig: properties: aws_access_key_id: anyOf: - type: string - type: 'null' title: AWS Access Key aws_cloudwatch_log_group: anyOf: - type: string - type: 'null' title: Cloudwatch Postgres Log Group aws_region: anyOf: - type: string - type: 'null' title: AWS Region aws_secret_access_key: anyOf: - format: password type: string writeOnly: true - type: 'null' title: AWS Secret dbname: title: Database Name type: string host: maxLength: 128 title: Host type: string keep_alive: anyOf: - type: integer - type: 'null' title: Keep Alive timeout in seconds, leave empty to disable password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 5432 title: Port type: integer role: anyOf: - type: string - type: 'null' title: Role (case sensitive) rootcert: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Root certificate sslmode: $ref: '#/components/schemas/SslMode' default: prefer title: SSL Mode user: title: User type: string required: - host - user - dbname title: PostgreSQLAuroraConfig type: object RedshiftConfig: properties: adhoc_query_group: default: default section: config title: Query Group for Adhoc Queries type: string dbname: title: Database Name type: string host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 5432 title: Port type: integer role: anyOf: - type: string - type: 'null' title: Role (case sensitive) rootcert: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Root certificate scheduled_query_group: default: default section: config title: Query Group for Scheduled Queries type: string sslmode: $ref: '#/components/schemas/SslMode' default: prefer title: SSL Mode user: title: User type: string required: - host - user - dbname title: RedshiftConfig type: object TeradataConfig: properties: database: title: Database type: string host: maxLength: 128 title: Host type: string password: format: password title: Password type: string writeOnly: true port: anyOf: - type: integer - type: 'null' title: Port user: default: DATAFOLD title: User type: string required: - host - password - database title: TeradataConfig type: object SapHanaConfig: properties: host: maxLength: 128 title: Host type: string password: format: password title: Password type: string writeOnly: true port: default: 443 title: Port type: integer user: default: DATAFOLD title: User type: string required: - host - password title: SapHanaConfig type: object AwsAthenaConfig: properties: aws_access_key_id: title: Aws Access Key Id type: string aws_secret_access_key: format: password title: Aws Secret Access Key type: string writeOnly: true catalog: default: awsdatacatalog title: Catalog type: string database: default: default title: Database type: string region: title: Region type: string s3_staging_dir: format: uri minLength: 1 title: S3 Staging Dir type: string required: - aws_access_key_id - aws_secret_access_key - s3_staging_dir - region title: AwsAthenaConfig type: object SnowflakeConfig: properties: account: maxLength: 128 title: Account type: string authMethod: anyOf: - enum: - password - keypair type: string - type: 'null' title: Authmethod data_source_id: anyOf: - type: integer - type: 'null' title: Data Source Id default_db: default: '' examples: - MY_DB title: Default DB (case sensitive) type: string default_schema: default: PUBLIC examples: - PUBLIC section: config title: Default schema (case sensitive) type: string keyPairFile: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Key Pair file (private-key) metadata_database: default: SNOWFLAKE examples: - SNOWFLAKE section: config title: Database containing metadata (usually SNOWFLAKE) type: string oauth_dwh_client_id: anyOf: - type: string - type: 'null' title: Oauth Dwh Client Id oauth_dwh_client_secret: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Oauth Dwh Client Secret password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: anyOf: - type: integer - type: 'null' default: 443 title: Port region: anyOf: - type: string - type: 'null' section: config title: Region role: default: '' examples: - PUBLIC title: Role (case sensitive) type: string sql_variables: anyOf: - type: string - type: 'null' examples: - |- variable_1=10 variable_2=test section: config title: Session variables applied at every connection. widget: multiline user: default: DATAFOLD title: User type: string user_id: anyOf: - type: integer - type: 'null' title: User Id warehouse: default: '' examples: - COMPUTE_WH title: Warehouse (case sensitive) type: string required: - account title: SnowflakeConfig type: object DremioConfig: properties: certcheck: anyOf: - $ref: '#/components/schemas/CertCheck' - type: 'null' default: dremio-cloud title: Certificate check customcert: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Custom certificate host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 443 title: Port type: integer project_id: anyOf: - type: string - type: 'null' title: Project id role: anyOf: - type: string - type: 'null' title: Role (case sensitive) tls: default: false title: Encryption type: boolean token: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Token username: anyOf: - type: string - type: 'null' title: User ID (optional) view_temp_schema: anyOf: - type: string - type: 'null' title: Temporary schema for views required: - host title: DremioConfig type: object StarburstConfig: properties: host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 443 title: Port type: integer tls: default: true title: Encryption type: boolean token: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Token username: anyOf: - type: string - type: 'null' title: User ID (optional) required: - host title: StarburstConfig type: object NetezzaConfig: properties: database: maxLength: 128 title: Database type: string host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 5480 title: Port type: integer tls: default: true title: Encryption type: boolean username: anyOf: - type: string - type: 'null' title: User ID (optional) required: - host - database title: NetezzaConfig type: object AzureDataLakeConfig: properties: account_name: title: Account Name type: string client_id: anyOf: - maxLength: 1024 type: string - type: 'null' title: Client Id client_secret: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Client Secret materialize_max_rows: anyOf: - type: integer - type: 'null' title: Materialize Max Rows materialize_path: anyOf: - type: string - type: 'null' title: Materialize Path tenant_id: anyOf: - maxLength: 1024 type: string - type: 'null' title: Tenant Id required: - account_name - tenant_id - client_id title: AzureDataLakeConfig type: object GCSConfig: properties: bucket_name: title: Bucket Name type: string bucket_region: title: Bucket Region type: string jsonKeyFile: format: password section: basic title: JSON Key File type: string writeOnly: true materialize_max_rows: anyOf: - type: integer - type: 'null' title: Materialize Max Rows materialize_path: anyOf: - type: string - type: 'null' title: Materialize Path required: - bucket_name - jsonKeyFile - bucket_region title: GCSConfig type: object AWSS3Config: properties: bucket_name: title: Bucket Name type: string key_id: anyOf: - maxLength: 1024 type: string - type: 'null' title: Key Id materialize_max_rows: anyOf: - type: integer - type: 'null' title: Materialize Max Rows materialize_path: anyOf: - type: string - type: 'null' title: Materialize Path region: title: Region type: string secret: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Secret required: - bucket_name - key_id - region title: AWSS3Config type: object MicrosoftFabricConfig: properties: client_id: description: Microsoft Entra ID Application (Client) ID title: Application (Client) ID type: string client_secret: description: Microsoft Entra ID Application Client Secret format: password title: Client Secret type: string writeOnly: true dbname: title: Dbname type: string host: maxLength: 128 title: Host type: string session_script: anyOf: - type: string - type: 'null' description: >- The script to execute on connection; e.g. ALTER SESSION SET CONTAINER = ... title: Init script tenant_id: description: Microsoft Entra ID Tenant ID title: Tenant ID type: string required: - host - dbname - tenant_id - client_id - client_secret title: MicrosoftFabricConfig type: object VerticaConfig: properties: dbname: title: Database Name type: string host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 5433 title: Port type: integer role: anyOf: - type: string - type: 'null' title: Role (case sensitive) sslmode: $ref: '#/components/schemas/SslMode' default: prefer title: SSL Mode user: title: User type: string required: - host - user - dbname title: VerticaConfig type: object TrinoConfig: properties: dbname: title: Catalog Name type: string hive_timestamp_precision: anyOf: - enum: - 3 - 6 - 9 type: integer - type: 'null' description: 'Optional: Timestamp precision if using Hive connector' title: Hive Timestamp Precision host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 8080 title: Port type: integer ssl_verification: $ref: '#/components/schemas/SSLVerification' default: full title: SSL Verification tls: default: true title: Encryption type: boolean user: title: User type: string required: - host - user - dbname title: TrinoConfig type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object TestResultStep: properties: result: anyOf: - {} - type: 'null' title: Result status: $ref: '#/components/schemas/JobStatus' step: $ref: '#/components/schemas/ConfigurationCheckStep' required: - step - status title: TestResultStep type: object SslMode: description: >- SSL mode for database connections (used by PostgreSQL, Vertica, Redshift, etc.) enum: - prefer - require - verify-ca - verify-full title: SslMode type: string CertCheck: enum: - disable - dremio-cloud - customcert title: CertCheck type: string SSLVerification: enum: - full - none - ca title: SSLVerification type: string JobStatus: enum: - needs_confirmation - needs_authentication - waiting - processing - done - failed - cancelled title: JobStatus type: string ConfigurationCheckStep: enum: - connection - temp_schema - schema_download - lineage_download title: ConfigurationCheckStep type: string securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/data-diffs/get-a-human-readable-summary-of-a-datadiff-comparison.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get a human-readable summary of a DataDiff comparison > Retrieves a comprehensive, human-readable summary of a completed data diff. This endpoint provides the most useful information for understanding diff results: - Overall status and result (success/failure) - Human-readable feedback explaining the differences found - Key statistics (row counts, differences, match rates) - Configuration details (tables compared, primary keys used) - Error messages if the diff failed Use this after a diff completes to get actionable insights. For diffs still running, check status with get_datadiff first. ## OpenAPI ````yaml openapi-public.json get /api/v1/datadiffs/{datadiff_id}/summary openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/datadiffs/{datadiff_id}/summary: get: tags: - Data diffs summary: Get a human-readable summary of a DataDiff comparison description: >- Retrieves a comprehensive, human-readable summary of a completed data diff. This endpoint provides the most useful information for understanding diff results: - Overall status and result (success/failure) - Human-readable feedback explaining the differences found - Key statistics (row counts, differences, match rates) - Configuration details (tables compared, primary keys used) - Error messages if the diff failed Use this after a diff completes to get actionable insights. For diffs still running, check status with get_datadiff first. operationId: get_datadiff_summary parameters: - in: path name: datadiff_id required: true schema: title: Data diff id type: integer responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiDataDiffSummary' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ApiDataDiffSummary: description: Summary of a DataDiff comparison with human-readable feedback. properties: algorithm: anyOf: - type: string - type: 'null' title: Algorithm created_at: anyOf: - format: date-time type: string - type: 'null' title: Created At data_source1_id: title: Data Source1 Id type: integer data_source1_name: anyOf: - type: string - type: 'null' title: Data Source1 Name data_source2_id: title: Data Source2 Id type: integer data_source2_name: anyOf: - type: string - type: 'null' title: Data Source2 Name diff_stats: anyOf: - additionalProperties: true type: object - type: 'null' title: Diff Stats error: anyOf: - type: string - type: 'null' title: Error feedback: anyOf: - type: string - type: 'null' title: Feedback finished_at: anyOf: - format: date-time type: string - type: 'null' title: Finished At id: title: Id type: integer include_columns: anyOf: - items: type: string type: array - type: 'null' title: Include Columns pk_columns: anyOf: - items: type: string type: array - type: 'null' title: Pk Columns result: anyOf: - type: string - type: 'null' title: Result result_status: anyOf: - type: string - type: 'null' title: Result Status results_count: default: 0 title: Results Count type: integer sampling_ratio: anyOf: - type: number - type: 'null' title: Sampling Ratio status: $ref: '#/components/schemas/JobStatus' table1: anyOf: - items: type: string type: array - type: 'null' title: Table1 table2: anyOf: - items: type: string type: array - type: 'null' title: Table2 required: - id - status - data_source1_id - data_source2_id title: ApiDataDiffSummary type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object JobStatus: enum: - needs_confirmation - needs_authentication - waiting - processing - done - failed - cancelled title: JobStatus type: string ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-all-columns-for-a-specific-table.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get all columns for a specific table > List all columns in a dataset with metadata. Returns the complete schema of a table/view including column names, data types, usage statistics, and popularity scores. Useful for exploring table structure before diving into column-level lineage. ## OpenAPI ````yaml openapi-public.json get /api/v1/lineagev2/table/{table_id}/columns openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineagev2/table/{table_id}/columns: get: tags: - lineagev2 summary: Get all columns for a specific table description: >- List all columns in a dataset with metadata. Returns the complete schema of a table/view including column names, data types, usage statistics, and popularity scores. Useful for exploring table structure before diving into column-level lineage. operationId: lineagev2_table_columns parameters: - in: path name: table_id required: true schema: title: Table Id type: string responses: '200': content: application/json: schema: $ref: '#/components/schemas/TableColumnsResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: TableColumnsResponse: properties: columns: items: $ref: '#/components/schemas/ColumnInfo' title: Columns type: array required: - columns title: TableColumnsResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ColumnInfo: properties: dataType: anyOf: - type: string - type: 'null' title: Datatype id: title: Id type: string name: title: Name type: string popularity: default: 0 title: Popularity type: number totalQueries30d: anyOf: - type: integer - type: 'null' title: Totalqueries30D required: - id - name title: ColumnInfo type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/bi/get-an-integration.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get an integration > Returns the integration for Mode/Tableau/Looker/HighTouch by its id. ## OpenAPI ````yaml get /api/v1/lineage/bi/{bi_datasource_id}/ openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineage/bi/{bi_datasource_id}/: get: tags: - BI summary: Get an integration description: Returns the integration for Mode/Tableau/Looker/HighTouch by its id. operationId: get_integration_api_v1_lineage_bi__bi_datasource_id___get parameters: - in: path name: bi_datasource_id required: true schema: title: BI integration id type: integer responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/audit-logs/get-audit-logs.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get Audit Logs ## OpenAPI ````yaml get /api/v1/audit_logs openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/audit_logs: get: tags: - Audit Logs summary: Get Audit Logs operationId: get_audit_logs_api_v1_audit_logs_get requestBody: content: application/json: schema: anyOf: - $ref: '#/components/schemas/ApiDownloadAuditLogs' - type: 'null' title: Data responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiGetAuditLogs' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ApiDownloadAuditLogs: properties: end_date: anyOf: - format: date-time type: string - type: 'null' title: End Date start_date: anyOf: - format: date-time type: string - type: 'null' title: Start Date title: ApiDownloadAuditLogs type: object ApiGetAuditLogs: properties: logs: items: $ref: '#/components/schemas/AuditLogs' title: Logs type: array required: - logs title: ApiGetAuditLogs type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object AuditLogs: properties: action: anyOf: - type: string - type: 'null' title: Action client_ip: title: Client Ip type: string event_uuid: title: Event Uuid type: string is_support_user: anyOf: - type: boolean - type: 'null' title: Is Support User log_entry: anyOf: - type: string - type: 'null' title: Log Entry object_id: anyOf: - type: integer - type: 'null' title: Object Id object_type: anyOf: - type: string - type: 'null' title: Object Type payload: anyOf: - additionalProperties: true type: object - type: 'null' title: Payload referer: anyOf: - type: string - type: 'null' title: Referer request_type: anyOf: - type: string - type: 'null' title: Request Type source: anyOf: - type: string - type: 'null' title: Source status: anyOf: - type: string - type: 'null' title: Status timestamp: title: Timestamp type: string url: title: Url type: string user_agent: anyOf: - type: string - type: 'null' title: User Agent user_email: anyOf: - type: string - type: 'null' title: User Email user_id: anyOf: - type: integer - type: 'null' title: User Id required: - timestamp - event_uuid - client_ip - url title: AuditLogs type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-available-type-filters-for-search.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get available type filters for search > Returns available type filters for narrowing search results (e.g., type:table, type:column). ## OpenAPI ````yaml openapi-public.json get /api/v1/lineagev2/search/types openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineagev2/search/types: get: tags: - lineagev2 summary: Get available type filters for search description: >- Returns available type filters for narrowing search results (e.g., type:table, type:column). operationId: lineagev2_search_types parameters: - in: query name: prefix required: false schema: default: '' title: Prefix type: string responses: '200': content: application/json: schema: $ref: '#/components/schemas/TypeSuggestionsResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: TypeSuggestionsResponse: properties: types: items: $ref: '#/components/schemas/TypeSuggestion' title: Types type: array required: - types title: TypeSuggestionsResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object TypeSuggestion: properties: description: title: Description type: string example: title: Example type: string type: title: Type type: string required: - type - description - example title: TypeSuggestion type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/explore/get-column-downstreams.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get column downstreams > Retrieve a list of columns or tables which depend on the given column. ## OpenAPI ````yaml openapi-public.json get /api/v1/explore/db/{data_connection_id}/columns/{column_path}/downstreams openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/explore/db/{data_connection_id}/columns/{column_path}/downstreams: get: tags: - Explore summary: Get column downstreams description: Retrieve a list of columns or tables which depend on the given column. operationId: >- db_column_downstreams_api_v1_explore_db__data_connection_id__columns__column_path__downstreams_get parameters: - description: >- Unique ID for the Data Connection. Can be found in the Datafold app under Settings > Integrations > Data Connections. in: path name: data_connection_id required: true schema: description: >- Unique ID for the Data Connection. Can be found in the Datafold app under Settings > Integrations > Data Connections. minimum: 1 title: Data Connection ID type: integer - description: >- Path to the column, e.g. `db.schema.table.column`. The path is case sensitive. If components of the path contain periods, they must be quoted: `db.my_schema."www.mysite.com visits"."visit.id"`. in: path name: column_path required: true schema: description: >- Path to the column, e.g. `db.schema.table.column`. The path is case sensitive. If components of the path contain periods, they must be quoted: `db.my_schema."www.mysite.com visits"."visit.id"`. title: Table Column Path type: string - description: Maximum depth of the lineage to retrieve. in: query name: max_depth required: false schema: default: 10 description: Maximum depth of the lineage to retrieve. exclusiveMaximum: 100 minimum: 1 title: Max depth type: integer - description: Include Tables in the lineage calculation and in the output. in: query name: include_tabular_nodes required: false schema: default: true description: Include Tables in the lineage calculation and in the output. title: Include tabular nodes type: boolean responses: '200': content: application/json: schema: items: anyOf: - $ref: '#/components/schemas/Column' - $ref: '#/components/schemas/Table' title: >- Response Db Column Downstreams Api V1 Explore Db Data Connection Id Columns Column Path Downstreams Get type: array description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: Column: description: Database table column. properties: name: title: Name type: string table: $ref: '#/components/schemas/datafold__lineage__api__db__TableReference' type: const: Column default: Column title: Type type: string required: - name - table title: Column type: object Table: description: Database table. properties: columns: items: $ref: '#/components/schemas/ColumnReference' title: Columns type: array name: title: Name type: string path: items: type: string title: Table path type: array type: const: Table default: Table title: Type type: string required: - name - columns - path title: Table type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object datafold__lineage__api__db__TableReference: description: Database table reference. properties: name: title: Table name type: string path: items: type: string title: Table path type: array required: - name - path title: TableReference type: object ColumnReference: description: Database table column reference. properties: name: title: Column name type: string required: - name title: ColumnReference type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-column-level-lineage-field-level-data-flow.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get column-level lineage (field-level data flow) > Get the lineage graph for a specific column. Returns upstream source columns (where this column's data originates) and downstream dependent columns (where this column's data flows to). Provides fine-grained lineage tracking at the field level. Use this for precise impact analysis, data quality root cause analysis, and understanding transformations applied to specific fields. ## OpenAPI ````yaml openapi-public.json get /api/v1/lineagev2/column-lineage/{column_id} openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineagev2/column-lineage/{column_id}: get: tags: - lineagev2 summary: Get column-level lineage (field-level data flow) description: >- Get the lineage graph for a specific column. Returns upstream source columns (where this column's data originates) and downstream dependent columns (where this column's data flows to). Provides fine-grained lineage tracking at the field level. Use this for precise impact analysis, data quality root cause analysis, and understanding transformations applied to specific fields. operationId: lineagev2_column_lineage parameters: - in: path name: column_id required: true schema: title: Column Id type: string - in: query name: direction required: false schema: default: both title: Direction type: string - in: query name: depth required: false schema: anyOf: - type: integer - type: 'null' title: Depth - in: query name: debug required: false schema: default: false title: Debug type: boolean responses: '200': content: application/json: schema: $ref: '#/components/schemas/ColumnLineageResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ColumnLineageResponse: properties: column: $ref: '#/components/schemas/ColumnNode' downstream: items: $ref: '#/components/schemas/ColumnNode' title: Downstream type: array edges: items: $ref: '#/components/schemas/LineageEdge' title: Edges type: array queries: default: [] items: $ref: '#/components/schemas/CypherQueryInfo' title: Queries type: array upstream: items: $ref: '#/components/schemas/ColumnNode' title: Upstream type: array required: - column - upstream - downstream - edges title: ColumnLineageResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ColumnNode: properties: assetType: title: Assettype type: string definitionSql: anyOf: - type: string - type: 'null' title: Definitionsql depth: default: 0 title: Depth type: integer expression: anyOf: - type: string - type: 'null' title: Expression id: title: Id type: string isSource: anyOf: - type: boolean - type: 'null' title: Issource name: title: Name type: string popularity: default: 0 title: Popularity type: number semanticModel: anyOf: - type: string - type: 'null' title: Semanticmodel statementType: anyOf: - type: string - type: 'null' title: Statementtype tableId: title: Tableid type: string tableName: title: Tablename type: string totalQueries30d: anyOf: - type: integer - type: 'null' title: Totalqueries30D transformType: anyOf: - type: string - type: 'null' title: Transformtype workspace: anyOf: - type: string - type: 'null' title: Workspace required: - id - name - tableId - tableName - assetType title: ColumnNode type: object LineageEdge: properties: source: title: Source type: string target: title: Target type: string required: - source - target title: LineageEdge type: object CypherQueryInfo: properties: name: title: Name type: string params: additionalProperties: true title: Params type: object query: title: Query type: string required: - name - query - params title: CypherQueryInfo type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-column-level-lineage-for-a-dataset.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get column-level lineage for a dataset > Get column-level lineage for a dataset (table, PowerBI visual, tile, etc.). For PowerBI visuals/tiles: shows columns they USES and their DERIVED_FROM lineage. For regular tables: shows columns that BELONGS_TO the table and their DERIVED_FROM lineage. This endpoint is particularly useful for PowerBI assets that use columns from multiple tables. ## OpenAPI ````yaml openapi-public.json get /api/v1/lineagev2/dataset-column-lineage/{dataset_id} openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineagev2/dataset-column-lineage/{dataset_id}: get: tags: - lineagev2 summary: Get column-level lineage for a dataset description: >- Get column-level lineage for a dataset (table, PowerBI visual, tile, etc.). For PowerBI visuals/tiles: shows columns they USES and their DERIVED_FROM lineage. For regular tables: shows columns that BELONGS_TO the table and their DERIVED_FROM lineage. This endpoint is particularly useful for PowerBI assets that use columns from multiple tables. operationId: lineagev2_dataset_column_lineage parameters: - in: path name: dataset_id required: true schema: title: Dataset Id type: string - in: query name: direction required: false schema: default: upstream title: Direction type: string - in: query name: depth required: false schema: anyOf: - type: integer - type: 'null' title: Depth - in: query name: debug required: false schema: default: false title: Debug type: boolean responses: '200': content: application/json: schema: $ref: '#/components/schemas/DatasetColumnLineageResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: DatasetColumnLineageResponse: properties: columns: items: $ref: '#/components/schemas/ColumnNodeExtended' title: Columns type: array dataset: $ref: '#/components/schemas/DatasetInfo' downstream: items: $ref: '#/components/schemas/ColumnNodeExtended' title: Downstream type: array edges: items: $ref: '#/components/schemas/LineageEdge' title: Edges type: array queries: default: [] items: $ref: '#/components/schemas/CypherQueryInfo' title: Queries type: array upstream: items: $ref: '#/components/schemas/ColumnNodeExtended' title: Upstream type: array required: - dataset - columns - upstream - downstream - edges title: DatasetColumnLineageResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ColumnNodeExtended: properties: assetType: title: Assettype type: string definitionSql: anyOf: - type: string - type: 'null' title: Definitionsql depth: default: 0 title: Depth type: integer expression: anyOf: - type: string - type: 'null' title: Expression id: title: Id type: string isSelected: anyOf: - type: boolean - type: 'null' title: Isselected isSource: anyOf: - type: boolean - type: 'null' title: Issource name: title: Name type: string popularity: default: 0 title: Popularity type: number semanticModel: anyOf: - type: string - type: 'null' title: Semanticmodel statementType: anyOf: - type: string - type: 'null' title: Statementtype tableId: title: Tableid type: string tableName: title: Tablename type: string totalQueries30d: anyOf: - type: integer - type: 'null' title: Totalqueries30D transformType: anyOf: - type: string - type: 'null' title: Transformtype workspace: anyOf: - type: string - type: 'null' title: Workspace required: - id - name - tableId - tableName - assetType title: ColumnNodeExtended type: object DatasetInfo: properties: assetType: title: Assettype type: string id: title: Id type: string name: title: Name type: string required: - id - name - assetType title: DatasetInfo type: object LineageEdge: properties: source: title: Source type: string target: title: Target type: string required: - source - target title: LineageEdge type: object CypherQueryInfo: properties: name: title: Name type: string params: additionalProperties: true title: Params type: object query: title: Query type: string required: - name - query - params title: CypherQueryInfo type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-column-lineage.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get Column Lineage > Get column-level lineage. Args: column_id: Full column identifier (format: database.schema.table.column or similar path) direction: Lineage direction - "upstream", "downstream", or "both" (default: "both") depth: Maximum traversal depth (default: configured system depth, typically 3-5 hops) Returns: ColumnLineageResponse containing: - column: The requested column with table context and metadata - upstream: List of source columns this column derives from - downstream: List of dependent columns derived from this column - edges: DERIVED_FROM relationships between all returned columns Example: - Get full column lineage: column_id="analytics.fact_orders.customer_id", direction="both" - Trace column origin: column_id="analytics.dim_customer.email", direction="upstream" - Find column usage: column_id="raw.users.user_id", direction="downstream", depth=3 Note: depth parameter is interpolated into Cypher query using f-string because Cypher does not support parameterized variable-length path patterns (*1..{depth}). Input is validated as int by FastAPI. ## OpenAPI ````yaml openapi-public.json get /api/internal/lineagev2/column-lineage/{column_id} openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/internal/lineagev2/column-lineage/{column_id}: get: tags: - lineagev2 summary: Get Column Lineage description: >- Get column-level lineage. Args: column_id: Full column identifier (format: database.schema.table.column or similar path) direction: Lineage direction - "upstream", "downstream", or "both" (default: "both") depth: Maximum traversal depth (default: configured system depth, typically 3-5 hops) Returns: ColumnLineageResponse containing: - column: The requested column with table context and metadata - upstream: List of source columns this column derives from - downstream: List of dependent columns derived from this column - edges: DERIVED_FROM relationships between all returned columns Example: - Get full column lineage: column_id="analytics.fact_orders.customer_id", direction="both" - Trace column origin: column_id="analytics.dim_customer.email", direction="upstream" - Find column usage: column_id="raw.users.user_id", direction="downstream", depth=3 Note: depth parameter is interpolated into Cypher query using f-string because Cypher does not support parameterized variable-length path patterns (*1..{depth}). Input is validated as int by FastAPI. operationId: get_column_lineage_api_internal_lineagev2_column_lineage__column_id__get parameters: - in: path name: column_id required: true schema: title: Column Id type: string - in: query name: direction required: false schema: default: both title: Direction type: string - in: query name: depth required: false schema: anyOf: - type: integer - type: 'null' title: Depth - in: query name: debug required: false schema: default: false title: Debug type: boolean responses: '200': content: application/json: schema: $ref: '#/components/schemas/ColumnLineageResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ColumnLineageResponse: properties: column: $ref: '#/components/schemas/ColumnNode' downstream: items: $ref: '#/components/schemas/ColumnNode' title: Downstream type: array edges: items: $ref: '#/components/schemas/LineageEdge' title: Edges type: array queries: default: [] items: $ref: '#/components/schemas/CypherQueryInfo' title: Queries type: array upstream: items: $ref: '#/components/schemas/ColumnNode' title: Upstream type: array required: - column - upstream - downstream - edges title: ColumnLineageResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ColumnNode: properties: assetType: title: Assettype type: string definitionSql: anyOf: - type: string - type: 'null' title: Definitionsql depth: default: 0 title: Depth type: integer expression: anyOf: - type: string - type: 'null' title: Expression id: title: Id type: string isSource: anyOf: - type: boolean - type: 'null' title: Issource name: title: Name type: string popularity: default: 0 title: Popularity type: number semanticModel: anyOf: - type: string - type: 'null' title: Semanticmodel statementType: anyOf: - type: string - type: 'null' title: Statementtype tableId: title: Tableid type: string tableName: title: Tablename type: string totalQueries30d: anyOf: - type: integer - type: 'null' title: Totalqueries30D transformType: anyOf: - type: string - type: 'null' title: Transformtype workspace: anyOf: - type: string - type: 'null' title: Workspace required: - id - name - tableId - tableName - assetType title: ColumnNode type: object LineageEdge: properties: source: title: Source type: string target: title: Target type: string required: - source - target title: LineageEdge type: object CypherQueryInfo: properties: name: title: Name type: string params: additionalProperties: true title: Params type: object query: title: Query type: string required: - name - query - params title: CypherQueryInfo type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/explore/get-column-upstreams.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get column upstreams > Retrieve a list of columns or tables which the given column depends on. ## OpenAPI ````yaml openapi-public.json get /api/v1/explore/db/{data_connection_id}/columns/{column_path}/upstreams openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/explore/db/{data_connection_id}/columns/{column_path}/upstreams: get: tags: - Explore summary: Get column upstreams description: Retrieve a list of columns or tables which the given column depends on. operationId: >- db_column_upstreams_api_v1_explore_db__data_connection_id__columns__column_path__upstreams_get parameters: - description: >- Unique ID for the Data Connection. Can be found in the Datafold app under Settings > Integrations > Data Connections. in: path name: data_connection_id required: true schema: description: >- Unique ID for the Data Connection. Can be found in the Datafold app under Settings > Integrations > Data Connections. minimum: 1 title: Data Connection ID type: integer - description: >- Path to the column, e.g. `db.schema.table.column`. The path is case sensitive. If components of the path contain periods, they must be quoted: `db.my_schema."www.mysite.com visits"."visit.id"`. in: path name: column_path required: true schema: description: >- Path to the column, e.g. `db.schema.table.column`. The path is case sensitive. If components of the path contain periods, they must be quoted: `db.my_schema."www.mysite.com visits"."visit.id"`. title: Table Column Path type: string - description: Maximum depth of the lineage to retrieve. in: query name: max_depth required: false schema: default: 10 description: Maximum depth of the lineage to retrieve. exclusiveMaximum: 100 minimum: 1 title: Max depth type: integer - description: Include Tables in the lineage calculation and in the output. in: query name: include_tabular_nodes required: false schema: default: true description: Include Tables in the lineage calculation and in the output. title: Include tabular nodes type: boolean responses: '200': content: application/json: schema: items: anyOf: - $ref: '#/components/schemas/Column' - $ref: '#/components/schemas/Table' title: >- Response Db Column Upstreams Api V1 Explore Db Data Connection Id Columns Column Path Upstreams Get type: array description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: Column: description: Database table column. properties: name: title: Name type: string table: $ref: '#/components/schemas/datafold__lineage__api__db__TableReference' type: const: Column default: Column title: Type type: string required: - name - table title: Column type: object Table: description: Database table. properties: columns: items: $ref: '#/components/schemas/ColumnReference' title: Columns type: array name: title: Name type: string path: items: type: string title: Table path type: array type: const: Table default: Table title: Type type: string required: - name - columns - path title: Table type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object datafold__lineage__api__db__TableReference: description: Database table reference. properties: name: title: Table name type: string path: items: type: string title: Table path type: array required: - name - path title: TableReference type: object ColumnReference: description: Database table column reference. properties: name: title: Column name type: string required: - name title: ColumnReference type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-config.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get Config > Get client-side configuration values. ## OpenAPI ````yaml openapi-public.json get /api/internal/lineagev2/config openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/internal/lineagev2/config: get: tags: - lineagev2 summary: Get Config description: Get client-side configuration values. operationId: get_config_api_internal_lineagev2_config_get responses: '200': content: application/json: schema: $ref: '#/components/schemas/ConfigResponse' description: Successful Response components: schemas: ConfigResponse: properties: lineage: additionalProperties: type: integer title: Lineage type: object required: - lineage title: ConfigResponse type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/data-sources/get-data-source-testing-results.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get data source testing results ## OpenAPI ````yaml get /api/v1/data_sources/test/{job_id} openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/data_sources/test/{job_id}: get: tags: - Data sources summary: Get data source testing results operationId: get_data_source_test_result_api_v1_data_sources_test__job_id__get parameters: - in: path name: job_id required: true schema: title: Data source testing task id type: integer responses: '200': content: application/json: schema: $ref: '#/components/schemas/AsyncDataSourceTestResults' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: AsyncDataSourceTestResults: properties: id: title: Id type: integer results: items: $ref: '#/components/schemas/TestResultStep' title: Results type: array status: $ref: '#/components/schemas/JobStatus' required: - id - status - results title: AsyncDataSourceTestResults type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object TestResultStep: properties: result: anyOf: - {} - type: 'null' title: Result status: $ref: '#/components/schemas/JobStatus' step: $ref: '#/components/schemas/ConfigurationCheckStep' required: - step - status title: TestResultStep type: object JobStatus: enum: - needs_confirmation - needs_authentication - waiting - processing - done - failed - cancelled title: JobStatus type: string ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object ConfigurationCheckStep: enum: - connection - temp_schema - schema_download - lineage_download title: ConfigurationCheckStep type: string securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-dataset-column-lineage.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get Dataset Column Lineage > Get column-level lineage for a dataset. ## OpenAPI ````yaml openapi-public.json get /api/internal/lineagev2/dataset-column-lineage/{dataset_id} openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/internal/lineagev2/dataset-column-lineage/{dataset_id}: get: tags: - lineagev2 summary: Get Dataset Column Lineage description: Get column-level lineage for a dataset. operationId: >- get_dataset_column_lineage_api_internal_lineagev2_dataset_column_lineage__dataset_id__get parameters: - in: path name: dataset_id required: true schema: title: Dataset Id type: string - in: query name: direction required: false schema: default: upstream title: Direction type: string - in: query name: depth required: false schema: anyOf: - type: integer - type: 'null' title: Depth - in: query name: debug required: false schema: default: false title: Debug type: boolean responses: '200': content: application/json: schema: $ref: '#/components/schemas/DatasetColumnLineageResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: DatasetColumnLineageResponse: properties: columns: items: $ref: '#/components/schemas/ColumnNodeExtended' title: Columns type: array dataset: $ref: '#/components/schemas/DatasetInfo' downstream: items: $ref: '#/components/schemas/ColumnNodeExtended' title: Downstream type: array edges: items: $ref: '#/components/schemas/LineageEdge' title: Edges type: array queries: default: [] items: $ref: '#/components/schemas/CypherQueryInfo' title: Queries type: array upstream: items: $ref: '#/components/schemas/ColumnNodeExtended' title: Upstream type: array required: - dataset - columns - upstream - downstream - edges title: DatasetColumnLineageResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ColumnNodeExtended: properties: assetType: title: Assettype type: string definitionSql: anyOf: - type: string - type: 'null' title: Definitionsql depth: default: 0 title: Depth type: integer expression: anyOf: - type: string - type: 'null' title: Expression id: title: Id type: string isSelected: anyOf: - type: boolean - type: 'null' title: Isselected isSource: anyOf: - type: boolean - type: 'null' title: Issource name: title: Name type: string popularity: default: 0 title: Popularity type: number semanticModel: anyOf: - type: string - type: 'null' title: Semanticmodel statementType: anyOf: - type: string - type: 'null' title: Statementtype tableId: title: Tableid type: string tableName: title: Tablename type: string totalQueries30d: anyOf: - type: integer - type: 'null' title: Totalqueries30D transformType: anyOf: - type: string - type: 'null' title: Transformtype workspace: anyOf: - type: string - type: 'null' title: Workspace required: - id - name - tableId - tableName - assetType title: ColumnNodeExtended type: object DatasetInfo: properties: assetType: title: Assettype type: string id: title: Id type: string name: title: Name type: string required: - id - name - assetType title: DatasetInfo type: object LineageEdge: properties: source: title: Source type: string target: title: Target type: string required: - source - target title: LineageEdge type: object CypherQueryInfo: properties: name: title: Name type: string params: additionalProperties: true title: Params type: object query: title: Query type: string required: - name - query - params title: CypherQueryInfo type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-lineage-configuration-settings.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get lineage configuration settings > Returns configuration values used by the lineage system. ## OpenAPI ````yaml openapi-public.json get /api/v1/lineagev2/config openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineagev2/config: get: tags: - lineagev2 summary: Get lineage configuration settings description: Returns configuration values used by the lineage system. operationId: lineagev2_config responses: '200': content: application/json: schema: $ref: '#/components/schemas/ConfigResponse' description: Successful Response components: schemas: ConfigResponse: properties: lineage: additionalProperties: type: integer title: Lineage type: object required: - lineage title: ConfigResponse type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-lineage-for-a-specific-query.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get lineage for a specific query > Returns tables and columns used by a query with lineage relationships. ## OpenAPI ````yaml openapi-public.json get /api/v1/lineagev2/query/{fingerprint}/lineage openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineagev2/query/{fingerprint}/lineage: get: tags: - lineagev2 summary: Get lineage for a specific query description: Returns tables and columns used by a query with lineage relationships. operationId: lineagev2_query_lineage parameters: - in: path name: fingerprint required: true schema: title: Fingerprint type: string responses: '200': content: application/json: schema: $ref: '#/components/schemas/QueryLineageResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: QueryLineageResponse: properties: columnLineage: items: additionalProperties: type: string type: object title: Columnlineage type: array outputColumns: items: additionalProperties: type: string type: object title: Outputcolumns type: array query: $ref: '#/components/schemas/QueryInfo' tablesRead: items: $ref: >- #/components/schemas/datafold__api__internal__lineagev2__api__TableReference title: Tablesread type: array required: - query - tablesRead - outputColumns - columnLineage title: QueryLineageResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object QueryInfo: properties: avgDurationMs: anyOf: - type: number - type: 'null' title: Avgdurationms executionCount: anyOf: - type: integer - type: 'null' title: Executioncount fingerprint: title: Fingerprint type: string lastExecuted: anyOf: - type: string - type: 'null' title: Lastexecuted normalizedSql: anyOf: - type: string - type: 'null' title: Normalizedsql popularity: anyOf: - type: number - type: 'null' title: Popularity sqlPreview: anyOf: - type: string - type: 'null' title: Sqlpreview statementType: anyOf: - type: string - type: 'null' title: Statementtype uniqueUsers: anyOf: - type: integer - type: 'null' title: Uniqueusers required: - fingerprint title: QueryInfo type: object datafold__api__internal__lineagev2__api__TableReference: properties: assetType: title: Assettype type: string id: title: Id type: string name: title: Name type: string rowCount: anyOf: - type: integer - type: 'null' title: Rowcount required: - id - name - assetType title: TableReference type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-lineage-graph-statistics-and-health-metrics.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get lineage graph statistics and health metrics > Get overall statistics about the lineage graph. Returns counts of all major entities in the lineage graph including datasets, columns, relationships, queries, and source files. Useful for understanding the scope and health of the lineage data. Use this to get a quick overview before exploring specific lineage paths. ## OpenAPI ````yaml openapi-public.json get /api/v1/lineagev2/stats openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineagev2/stats: get: tags: - lineagev2 summary: Get lineage graph statistics and health metrics description: >- Get overall statistics about the lineage graph. Returns counts of all major entities in the lineage graph including datasets, columns, relationships, queries, and source files. Useful for understanding the scope and health of the lineage data. Use this to get a quick overview before exploring specific lineage paths. operationId: lineagev2_stats responses: '200': content: application/json: schema: $ref: '#/components/schemas/StatsResponse' description: Successful Response components: schemas: StatsResponse: properties: columns: title: Columns type: integer datasets: title: Datasets type: integer queries: title: Queries type: integer relationships: title: Relationships type: integer sourceFiles: title: Sourcefiles type: integer required: - datasets - columns - relationships - queries - sourceFiles title: StatsResponse type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/monitors/get-monitor-run.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get Monitor Run ## OpenAPI ````yaml openapi-public.json get /api/v1/monitors/{id}/runs/{run_id} openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/monitors/{id}/runs/{run_id}: get: tags: - Monitors summary: Get Monitor Run operationId: get_monitor_run_api_v1_monitors__id__runs__run_id__get parameters: - description: The unique identifier of the run to retrieve. in: path name: run_id required: true schema: description: The unique identifier of the run to retrieve. title: Run Id type: integer - description: The unique identifier of the monitor associated with the run. in: path name: id required: true schema: description: The unique identifier of the monitor associated with the run. title: Id type: integer responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiPublicMonitorRunResultOut' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ApiPublicMonitorRunResultOut: properties: diff_id: anyOf: - type: integer - type: 'null' description: Unique identifier for the associated datadiff. title: Diff Id error: anyOf: - type: string - type: 'null' description: Error message if the run encountered an error. title: Error monitor_id: description: Unique identifier for the associated monitor. title: Monitor Id type: integer run_id: description: Unique identifier for the monitor run result. title: Run Id type: integer started_at: anyOf: - format: date-time type: string - type: 'null' description: Timestamp when the monitor run started. title: Started At state: $ref: '#/components/schemas/MonitorRunState' description: Current state of the monitor run result. warnings: description: List of warning messages generated during the run. items: type: string title: Warnings type: array required: - run_id - monitor_id - state - warnings title: ApiPublicMonitorRunResultOut type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object MonitorRunState: enum: - ok - alert - error - learning - checking - created - skipped - cancelled title: MonitorRunState type: string ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/monitors/get-monitor.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get Monitor ## OpenAPI ````yaml openapi-public.json get /api/v1/monitors/{id} openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/monitors/{id}: get: tags: - Monitors summary: Get Monitor operationId: get_monitor_api_v1_monitors__id__get parameters: - description: The unique identifier of the monitor. in: path name: id required: true schema: description: The unique identifier of the monitor. title: Id type: integer responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiPublicGetMonitorOutFull' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ApiPublicGetMonitorOutFull: properties: alert: anyOf: - discriminator: mapping: absolute: '#/components/schemas/AbsoluteThreshold' automatic: '#/components/schemas/AnomalyDetectionThreshold' diff: >- #/components/schemas/datafold__api__v1__monitors__DiffAlertCondition percentage: '#/components/schemas/PercentageThreshold' propertyName: type oneOf: - $ref: >- #/components/schemas/datafold__api__v1__monitors__DiffAlertCondition - $ref: '#/components/schemas/AnomalyDetectionThreshold' - $ref: '#/components/schemas/AbsoluteThreshold' - $ref: '#/components/schemas/PercentageThreshold' - type: 'null' description: Condition for triggering alerts based on the data diff. created_at: description: Timestamp when the monitor was created. format: date-time title: Created At type: string dataset: description: Dataset configuration for the monitor. items: $ref: '#/components/schemas/MonitorDataset' title: Dataset type: array description: anyOf: - type: string - type: 'null' description: The description of the monitor. title: Description enabled: description: Indicates whether the monitor is enabled. title: Enabled type: boolean id: description: Unique identifier for the monitor. title: Id type: integer last_alert: anyOf: - format: date-time type: string - type: 'null' description: Timestamp of the last alert. title: Last Alert last_run: anyOf: - format: date-time type: string - type: 'null' description: Timestamp of the last monitor run. title: Last Run modified_at: description: Timestamp when the monitor was last modified. format: date-time title: Modified At type: string monitor_type: anyOf: - enum: - diff - metric - schema - test type: string - type: 'null' description: Type of the monitor. title: Monitor Type name: anyOf: - type: string - type: 'null' description: Name of the monitor. title: Name notifications: description: Notification configuration for the monitor. items: discriminator: mapping: email: '#/components/schemas/EmailNotification' pagerduty: '#/components/schemas/PagerDutyNotification' slack: '#/components/schemas/SlackNotification' teams: '#/components/schemas/TeamsNotification' webhook: '#/components/schemas/WebhookNotification' propertyName: type oneOf: - $ref: '#/components/schemas/EmailNotification' - $ref: '#/components/schemas/PagerDutyNotification' - $ref: '#/components/schemas/WebhookNotification' - $ref: '#/components/schemas/SlackNotification' - $ref: '#/components/schemas/TeamsNotification' title: Notifications type: array schedule: anyOf: - $ref: '#/components/schemas/IntervalSchedule' - $ref: '#/components/schemas/CronSchedule' - $ref: '#/components/schemas/NoneSchedule' description: The schedule at which the monitor runs. state: anyOf: - $ref: '#/components/schemas/MonitorRunState' - type: 'null' description: Current state of the monitor run. tags: anyOf: - items: type: string type: array - type: 'null' description: Tags associated with the monitor. title: Tags required: - id - name - monitor_type - created_at - modified_at - enabled - schedule title: ApiPublicGetMonitorOutFull type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object datafold__api__v1__monitors__DiffAlertCondition: properties: different_rows_count: anyOf: - type: integer - type: 'null' description: >- Threshold for the number of different rows allowed between the datasets. title: Different Rows Count different_rows_percent: anyOf: - type: integer - type: 'null' description: >- Threshold for the percentage of different rows allowed between the datasets. title: Different Rows Percent type: const: diff title: Type type: string required: - type title: Diff Conditions type: object AnomalyDetectionThreshold: properties: sensitivity: description: Sensitivity level for anomaly detection, ranging from 0 to 100. maximum: 100 minimum: 0 title: Sensitivity type: integer type: const: automatic title: Type type: string required: - type - sensitivity title: Anomaly Detection type: object AbsoluteThreshold: properties: max: anyOf: - type: number - type: 'null' description: Maximum value for the absolute threshold. title: Max min: anyOf: - type: number - type: 'null' description: Minimum value for the absolute threshold. title: Min type: const: absolute title: Type type: string required: - type title: Absolute type: object PercentageThreshold: properties: decrease: anyOf: - type: number - type: integer - type: 'null' description: Threshold for allowable percentage decrease. title: Decrease increase: anyOf: - type: number - type: integer - type: 'null' description: Threshold for allowable percentage increase. title: Increase type: const: percentage title: Type type: string required: - type title: Percentage type: object MonitorDataset: properties: column: anyOf: - type: string - type: 'null' description: The column of the table. title: Column connection_id: description: The identifier for the data source configuration. title: Connection Id type: integer filter: anyOf: - type: string - type: 'null' description: Filter condition being evaluated. title: Filter metric: anyOf: - type: string - type: 'null' description: The column metric configuration. title: Metric query: anyOf: - type: string - type: 'null' description: The SQL query being evaluated. title: Query table: anyOf: - type: string - type: 'null' description: The name of the table. title: Table required: - connection_id title: MonitorDataset type: object EmailNotification: properties: features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features recipients: description: A list of email addresses to receive the notification. items: type: string title: Recipients type: array type: const: email default: email title: Type type: string required: - recipients title: Email type: object PagerDutyNotification: properties: features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer type: const: pagerduty default: pagerduty title: Type type: string required: - integration title: PagerDuty type: object WebhookNotification: properties: features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer type: const: webhook default: webhook title: Type type: string required: - integration title: Webhook type: object SlackNotification: properties: channel: description: The channel through which the notification will be sent. title: Channel type: string features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer mentions: description: A list of mentions to include in the notification. items: type: string title: Mentions type: array type: const: slack default: slack title: Type type: string required: - integration - channel title: Slack type: object TeamsNotification: properties: channel: description: The channel through which the notification will be sent. title: Channel type: string features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer mentions: description: A list of mentions names to include in the notification. items: type: string title: Mentions type: array type: const: teams default: teams title: Type type: string required: - integration - channel title: Teams type: object IntervalSchedule: properties: interval: anyOf: - $ref: '#/components/schemas/HourIntervalSchedule' - $ref: '#/components/schemas/DayIntervalSchedule' description: Specifies the scheduling interval. required: - interval title: Interval type: object CronSchedule: properties: cron: description: The cron expression that defines the schedule. title: Cron type: string type: const: crontab default: crontab title: Type type: string required: - cron title: Cron type: object NoneSchedule: properties: type: const: none default: none title: Type type: string title: None type: object MonitorRunState: enum: - ok - alert - error - learning - checking - created - skipped - cancelled title: MonitorRunState type: string ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object DestinationFeatures: enum: - attach_csv - notify_first_triggered_only - disable_recovery_notifications - notify_every_run title: DestinationFeatures type: string HourIntervalSchedule: properties: every: const: hour title: Every type: string type: const: hourly default: hourly title: Type type: string required: - every title: Hour type: object DayIntervalSchedule: properties: every: const: day title: Every type: string hour: anyOf: - type: integer - type: 'null' description: The hour at which the monitor should trigger. (0 - 23) title: Hour type: const: daily default: daily title: Type type: string utc_at: anyOf: - format: time type: string - type: 'null' description: The UTC time at which the monitor should trigger. title: Utc At required: - every title: Day type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-queries-that-read-from-a-table.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get queries that read from a table > Returns queries that read from this table, ordered by execution count. ## OpenAPI ````yaml openapi-public.json get /api/v1/lineagev2/table/{table_id}/queries openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineagev2/table/{table_id}/queries: get: tags: - lineagev2 summary: Get queries that read from a table description: Returns queries that read from this table, ordered by execution count. operationId: lineagev2_table_queries parameters: - in: path name: table_id required: true schema: title: Table Id type: string - in: query name: limit required: false schema: default: 20 title: Limit type: integer responses: '200': content: application/json: schema: $ref: '#/components/schemas/QueriesResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: QueriesResponse: properties: queries: items: $ref: '#/components/schemas/QueryInfo' title: Queries type: array required: - queries title: QueriesResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object QueryInfo: properties: avgDurationMs: anyOf: - type: number - type: 'null' title: Avgdurationms executionCount: anyOf: - type: integer - type: 'null' title: Executioncount fingerprint: title: Fingerprint type: string lastExecuted: anyOf: - type: string - type: 'null' title: Lastexecuted normalizedSql: anyOf: - type: string - type: 'null' title: Normalizedsql popularity: anyOf: - type: number - type: 'null' title: Popularity sqlPreview: anyOf: - type: string - type: 'null' title: Sqlpreview statementType: anyOf: - type: string - type: 'null' title: Statementtype uniqueUsers: anyOf: - type: integer - type: 'null' title: Uniqueusers required: - fingerprint title: QueryInfo type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-queries.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get Queries > Get top queries by execution count. ## OpenAPI ````yaml openapi-public.json get /api/internal/lineagev2/queries openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/internal/lineagev2/queries: get: tags: - lineagev2 summary: Get Queries description: Get top queries by execution count. operationId: get_queries_api_internal_lineagev2_queries_get parameters: - in: query name: limit required: false schema: default: 100 title: Limit type: integer - in: query name: statement_type required: false schema: anyOf: - type: string - type: 'null' title: Statement Type responses: '200': content: application/json: schema: $ref: '#/components/schemas/QueriesResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: QueriesResponse: properties: queries: items: $ref: '#/components/schemas/QueryInfo' title: Queries type: array required: - queries title: QueriesResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object QueryInfo: properties: avgDurationMs: anyOf: - type: number - type: 'null' title: Avgdurationms executionCount: anyOf: - type: integer - type: 'null' title: Executioncount fingerprint: title: Fingerprint type: string lastExecuted: anyOf: - type: string - type: 'null' title: Lastexecuted normalizedSql: anyOf: - type: string - type: 'null' title: Normalizedsql popularity: anyOf: - type: number - type: 'null' title: Popularity sqlPreview: anyOf: - type: string - type: 'null' title: Sqlpreview statementType: anyOf: - type: string - type: 'null' title: Statementtype uniqueUsers: anyOf: - type: integer - type: 'null' title: Uniqueusers required: - fingerprint title: QueryInfo type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-query-lineage-endpoint.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get Query Lineage Endpoint > Get tables and columns used by a query. ## OpenAPI ````yaml openapi-public.json get /api/internal/lineagev2/query/{fingerprint}/lineage openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/internal/lineagev2/query/{fingerprint}/lineage: get: tags: - lineagev2 summary: Get Query Lineage Endpoint description: Get tables and columns used by a query. operationId: >- get_query_lineage_endpoint_api_internal_lineagev2_query__fingerprint__lineage_get parameters: - in: path name: fingerprint required: true schema: title: Fingerprint type: string responses: '200': content: application/json: schema: $ref: '#/components/schemas/QueryLineageResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: QueryLineageResponse: properties: columnLineage: items: additionalProperties: type: string type: object title: Columnlineage type: array outputColumns: items: additionalProperties: type: string type: object title: Outputcolumns type: array query: $ref: '#/components/schemas/QueryInfo' tablesRead: items: $ref: >- #/components/schemas/datafold__api__internal__lineagev2__api__TableReference title: Tablesread type: array required: - query - tablesRead - outputColumns - columnLineage title: QueryLineageResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object QueryInfo: properties: avgDurationMs: anyOf: - type: number - type: 'null' title: Avgdurationms executionCount: anyOf: - type: integer - type: 'null' title: Executioncount fingerprint: title: Fingerprint type: string lastExecuted: anyOf: - type: string - type: 'null' title: Lastexecuted normalizedSql: anyOf: - type: string - type: 'null' title: Normalizedsql popularity: anyOf: - type: number - type: 'null' title: Popularity sqlPreview: anyOf: - type: string - type: 'null' title: Sqlpreview statementType: anyOf: - type: string - type: 'null' title: Statementtype uniqueUsers: anyOf: - type: integer - type: 'null' title: Uniqueusers required: - fingerprint title: QueryInfo type: object datafold__api__internal__lineagev2__api__TableReference: properties: assetType: title: Assettype type: string id: title: Id type: string name: title: Name type: string rowCount: anyOf: - type: integer - type: 'null' title: Rowcount required: - id - name - assetType title: TableReference type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-search-types-endpoint.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get Search Types Endpoint > Get available type filters for search autocomplete. ## OpenAPI ````yaml openapi-public.json get /api/internal/lineagev2/search/types openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/internal/lineagev2/search/types: get: tags: - lineagev2 summary: Get Search Types Endpoint description: Get available type filters for search autocomplete. operationId: get_search_types_endpoint_api_internal_lineagev2_search_types_get parameters: - in: query name: prefix required: false schema: default: '' title: Prefix type: string responses: '200': content: application/json: schema: $ref: '#/components/schemas/TypeSuggestionsResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: TypeSuggestionsResponse: properties: types: items: $ref: '#/components/schemas/TypeSuggestion' title: Types type: array required: - types title: TypeSuggestionsResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object TypeSuggestion: properties: description: title: Description type: string example: title: Example type: string type: title: Type type: string required: - type - description - example title: TypeSuggestion type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-stats.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get Stats > Get graph statistics. Returns: StatsResponse containing: - datasets: Total number of tables and views in the graph - columns: Total number of columns tracked - relationships: Total number of lineage edges (DEPENDS_ON + DERIVED_FROM) - queries: Total number of SELECT queries analyzed - sourceFiles: Total number of source SQL/dbt files processed Example response: { "datasets": 1250, "columns": 15680, "relationships": 8932, "queries": 4521, "sourceFiles": 892 } Use this to assess lineage coverage and data quality. ## OpenAPI ````yaml openapi-public.json get /api/internal/lineagev2/stats openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/internal/lineagev2/stats: get: tags: - lineagev2 summary: Get Stats description: |- Get graph statistics. Returns: StatsResponse containing: - datasets: Total number of tables and views in the graph - columns: Total number of columns tracked - relationships: Total number of lineage edges (DEPENDS_ON + DERIVED_FROM) - queries: Total number of SELECT queries analyzed - sourceFiles: Total number of source SQL/dbt files processed Example response: { "datasets": 1250, "columns": 15680, "relationships": 8932, "queries": 4521, "sourceFiles": 892 } Use this to assess lineage coverage and data quality. operationId: get_stats_api_internal_lineagev2_stats_get responses: '200': content: application/json: schema: $ref: '#/components/schemas/StatsResponse' description: Successful Response components: schemas: StatsResponse: properties: columns: title: Columns type: integer datasets: title: Datasets type: integer queries: title: Queries type: integer relationships: title: Relationships type: integer sourceFiles: title: Sourcefiles type: integer required: - datasets - columns - relationships - queries - sourceFiles title: StatsResponse type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-table-columns.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get Table Columns > Get all columns for a table. Args: table_id: Full table identifier (format: database.schema.table or similar path) Returns: TableColumnsResponse containing: - columns: List of all columns in the table with: - id: Unique column identifier - name: Column name - dataType: Column data type (if available) - totalQueries30d: Number of queries using this column in last 30 days - popularity: Relative popularity score (0-100) based on query usage Example: - List table schema: table_id="analytics.fact_orders" - Returns all columns like order_id, customer_id, amount, created_at with their metadata Use this to understand table structure and identify important columns before exploring column-level lineage. ## OpenAPI ````yaml openapi-public.json get /api/internal/lineagev2/table/{table_id}/columns openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/internal/lineagev2/table/{table_id}/columns: get: tags: - lineagev2 summary: Get Table Columns description: >- Get all columns for a table. Args: table_id: Full table identifier (format: database.schema.table or similar path) Returns: TableColumnsResponse containing: - columns: List of all columns in the table with: - id: Unique column identifier - name: Column name - dataType: Column data type (if available) - totalQueries30d: Number of queries using this column in last 30 days - popularity: Relative popularity score (0-100) based on query usage Example: - List table schema: table_id="analytics.fact_orders" - Returns all columns like order_id, customer_id, amount, created_at with their metadata Use this to understand table structure and identify important columns before exploring column-level lineage. operationId: get_table_columns_api_internal_lineagev2_table__table_id__columns_get parameters: - in: path name: table_id required: true schema: title: Table Id type: string responses: '200': content: application/json: schema: $ref: '#/components/schemas/TableColumnsResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: TableColumnsResponse: properties: columns: items: $ref: '#/components/schemas/ColumnInfo' title: Columns type: array required: - columns title: TableColumnsResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ColumnInfo: properties: dataType: anyOf: - type: string - type: 'null' title: Datatype id: title: Id type: string name: title: Name type: string popularity: default: 0 title: Popularity type: number totalQueries30d: anyOf: - type: integer - type: 'null' title: Totalqueries30D required: - id - name title: ColumnInfo type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/explore/get-table-downstreams.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get table downstreams > Retrieve a list of tables which depend on the given table. ## OpenAPI ````yaml openapi-public.json get /api/v1/explore/db/{data_connection_id}/tables/{table_path}/downstreams openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/explore/db/{data_connection_id}/tables/{table_path}/downstreams: get: tags: - Explore summary: Get table downstreams description: Retrieve a list of tables which depend on the given table. operationId: >- db_table_downstreams_api_v1_explore_db__data_connection_id__tables__table_path__downstreams_get parameters: - description: >- Unique ID for the Data Connection. Can be found in the Datafold app under Settings > Integrations > Data Connections. in: path name: data_connection_id required: true schema: description: >- Unique ID for the Data Connection. Can be found in the Datafold app under Settings > Integrations > Data Connections. minimum: 1 title: Data Connection ID type: integer - description: >- Path to the table, e.g. `db.schema.table`. The path is case sensitive. If components of the path contain periods, they must be quoted: `db."my.schema"."www.mysite.com visits"`. in: path name: table_path required: true schema: description: >- Path to the table, e.g. `db.schema.table`. The path is case sensitive. If components of the path contain periods, they must be quoted: `db."my.schema"."www.mysite.com visits"`. title: Table Path type: string - description: Maximum depth of the lineage to retrieve. in: query name: max_depth required: false schema: default: 10 description: Maximum depth of the lineage to retrieve. exclusiveMaximum: 100 minimum: 1 title: Max depth type: integer responses: '200': content: application/json: schema: items: $ref: '#/components/schemas/Table' title: >- Response Db Table Downstreams Api V1 Explore Db Data Connection Id Tables Table Path Downstreams Get type: array description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: Table: description: Database table. properties: columns: items: $ref: '#/components/schemas/ColumnReference' title: Columns type: array name: title: Name type: string path: items: type: string title: Table path type: array type: const: Table default: Table title: Type type: string required: - name - columns - path title: Table type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ColumnReference: description: Database table column reference. properties: name: title: Column name type: string required: - name title: ColumnReference type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-table-level-lineage-upstream-and-downstream-dependencies.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get table-level lineage (upstream and downstream dependencies) > Get the lineage graph for a specific dataset (table or view). Returns upstream sources (tables this dataset depends on) and downstream consumers (tables that depend on this dataset), along with dependency edges. Supports configurable traversal depth and direction. Use this to understand data flow and impact analysis at the table level. ## OpenAPI ````yaml openapi-public.json get /api/v1/lineagev2/table-lineage/{table_id} openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineagev2/table-lineage/{table_id}: get: tags: - lineagev2 summary: Get table-level lineage (upstream and downstream dependencies) description: >- Get the lineage graph for a specific dataset (table or view). Returns upstream sources (tables this dataset depends on) and downstream consumers (tables that depend on this dataset), along with dependency edges. Supports configurable traversal depth and direction. Use this to understand data flow and impact analysis at the table level. operationId: lineagev2_table_lineage parameters: - in: path name: table_id required: true schema: title: Table Id type: string - in: query name: direction required: false schema: default: both title: Direction type: string - in: query name: depth required: false schema: anyOf: - type: integer - type: 'null' title: Depth - in: query name: debug required: false schema: default: false title: Debug type: boolean responses: '200': content: application/json: schema: $ref: '#/components/schemas/TableLineageResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: TableLineageResponse: properties: dataset: $ref: '#/components/schemas/DatasetNode' downstream: items: $ref: '#/components/schemas/DatasetNode' title: Downstream type: array edges: items: $ref: '#/components/schemas/LineageEdge' title: Edges type: array queries: default: [] items: $ref: '#/components/schemas/CypherQueryInfo' title: Queries type: array upstream: items: $ref: '#/components/schemas/DatasetNode' title: Upstream type: array required: - dataset - upstream - downstream - edges title: TableLineageResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object DatasetNode: properties: assetType: title: Assettype type: string columnCount: anyOf: - type: integer - type: 'null' title: Columncount dashboard: anyOf: - type: string - type: 'null' title: Dashboard definitionSql: anyOf: - type: string - type: 'null' title: Definitionsql depth: default: 0 title: Depth type: integer id: title: Id type: string isSource: anyOf: - type: boolean - type: 'null' title: Issource name: title: Name type: string page: anyOf: - type: string - type: 'null' title: Page popularity: default: 0 title: Popularity type: number report: anyOf: - type: string - type: 'null' title: Report rowCount: anyOf: - type: integer - type: 'null' title: Rowcount semanticModel: anyOf: - type: string - type: 'null' title: Semanticmodel statementType: anyOf: - type: string - type: 'null' title: Statementtype totalQueries30d: anyOf: - type: integer - type: 'null' title: Totalqueries30D visualType: anyOf: - type: string - type: 'null' title: Visualtype workspace: anyOf: - type: string - type: 'null' title: Workspace required: - id - name - assetType title: DatasetNode type: object LineageEdge: properties: source: title: Source type: string target: title: Target type: string required: - source - target title: LineageEdge type: object CypherQueryInfo: properties: name: title: Name type: string params: additionalProperties: true title: Params type: object query: title: Query type: string required: - name - query - params title: CypherQueryInfo type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-table-lineage.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get Table Lineage > Get upstream/downstream table lineage. Args: table_id: Full table identifier (format: database.schema.table or similar path) direction: Lineage direction - "upstream", "downstream", or "both" (default: "both") depth: Maximum traversal depth (default: configured system depth, typically 3-5 hops) Returns: TableLineageResponse containing: - dataset: The requested table/view with metadata - upstream: List of source tables this dataset depends on - downstream: List of dependent tables that use this dataset - edges: Dependency relationships between all returned datasets Example: - Get full lineage: table_id="analytics.fact_orders", direction="both" - Get only sources: table_id="analytics.fact_orders", direction="upstream", depth=2 - Get only consumers: table_id="raw.customers", direction="downstream" Note: depth parameter is interpolated into Cypher query using f-string because Cypher does not support parameterized variable-length path patterns (*1..{depth}). Input is validated as int by FastAPI. ## OpenAPI ````yaml openapi-public.json get /api/internal/lineagev2/table-lineage/{table_id} openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/internal/lineagev2/table-lineage/{table_id}: get: tags: - lineagev2 summary: Get Table Lineage description: >- Get upstream/downstream table lineage. Args: table_id: Full table identifier (format: database.schema.table or similar path) direction: Lineage direction - "upstream", "downstream", or "both" (default: "both") depth: Maximum traversal depth (default: configured system depth, typically 3-5 hops) Returns: TableLineageResponse containing: - dataset: The requested table/view with metadata - upstream: List of source tables this dataset depends on - downstream: List of dependent tables that use this dataset - edges: Dependency relationships between all returned datasets Example: - Get full lineage: table_id="analytics.fact_orders", direction="both" - Get only sources: table_id="analytics.fact_orders", direction="upstream", depth=2 - Get only consumers: table_id="raw.customers", direction="downstream" Note: depth parameter is interpolated into Cypher query using f-string because Cypher does not support parameterized variable-length path patterns (*1..{depth}). Input is validated as int by FastAPI. operationId: get_table_lineage_api_internal_lineagev2_table_lineage__table_id__get parameters: - in: path name: table_id required: true schema: title: Table Id type: string - in: query name: direction required: false schema: default: both title: Direction type: string - in: query name: depth required: false schema: anyOf: - type: integer - type: 'null' title: Depth - in: query name: debug required: false schema: default: false title: Debug type: boolean responses: '200': content: application/json: schema: $ref: '#/components/schemas/TableLineageResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: TableLineageResponse: properties: dataset: $ref: '#/components/schemas/DatasetNode' downstream: items: $ref: '#/components/schemas/DatasetNode' title: Downstream type: array edges: items: $ref: '#/components/schemas/LineageEdge' title: Edges type: array queries: default: [] items: $ref: '#/components/schemas/CypherQueryInfo' title: Queries type: array upstream: items: $ref: '#/components/schemas/DatasetNode' title: Upstream type: array required: - dataset - upstream - downstream - edges title: TableLineageResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object DatasetNode: properties: assetType: title: Assettype type: string columnCount: anyOf: - type: integer - type: 'null' title: Columncount dashboard: anyOf: - type: string - type: 'null' title: Dashboard definitionSql: anyOf: - type: string - type: 'null' title: Definitionsql depth: default: 0 title: Depth type: integer id: title: Id type: string isSource: anyOf: - type: boolean - type: 'null' title: Issource name: title: Name type: string page: anyOf: - type: string - type: 'null' title: Page popularity: default: 0 title: Popularity type: number report: anyOf: - type: string - type: 'null' title: Report rowCount: anyOf: - type: integer - type: 'null' title: Rowcount semanticModel: anyOf: - type: string - type: 'null' title: Semanticmodel statementType: anyOf: - type: string - type: 'null' title: Statementtype totalQueries30d: anyOf: - type: integer - type: 'null' title: Totalqueries30D visualType: anyOf: - type: string - type: 'null' title: Visualtype workspace: anyOf: - type: string - type: 'null' title: Workspace required: - id - name - assetType title: DatasetNode type: object LineageEdge: properties: source: title: Source type: string target: title: Target type: string required: - source - target title: LineageEdge type: object CypherQueryInfo: properties: name: title: Name type: string params: additionalProperties: true title: Params type: object query: title: Query type: string required: - name - query - params title: CypherQueryInfo type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-table-queries.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get Table Queries > Get queries that read from this table. ## OpenAPI ````yaml openapi-public.json get /api/internal/lineagev2/table/{table_id}/queries openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/internal/lineagev2/table/{table_id}/queries: get: tags: - lineagev2 summary: Get Table Queries description: Get queries that read from this table. operationId: get_table_queries_api_internal_lineagev2_table__table_id__queries_get parameters: - in: path name: table_id required: true schema: title: Table Id type: string - in: query name: limit required: false schema: default: 20 title: Limit type: integer responses: '200': content: application/json: schema: $ref: '#/components/schemas/QueriesResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: QueriesResponse: properties: queries: items: $ref: '#/components/schemas/QueryInfo' title: Queries type: array required: - queries title: QueriesResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object QueryInfo: properties: avgDurationMs: anyOf: - type: number - type: 'null' title: Avgdurationms executionCount: anyOf: - type: integer - type: 'null' title: Executioncount fingerprint: title: Fingerprint type: string lastExecuted: anyOf: - type: string - type: 'null' title: Lastexecuted normalizedSql: anyOf: - type: string - type: 'null' title: Normalizedsql popularity: anyOf: - type: number - type: 'null' title: Popularity sqlPreview: anyOf: - type: string - type: 'null' title: Sqlpreview statementType: anyOf: - type: string - type: 'null' title: Statementtype uniqueUsers: anyOf: - type: integer - type: 'null' title: Uniqueusers required: - fingerprint title: QueryInfo type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/explore/get-table-upstreams.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get table upstreams > Retrieve a list of tables which the given table depends on. ## OpenAPI ````yaml openapi-public.json get /api/v1/explore/db/{data_connection_id}/tables/{table_path}/upstreams openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/explore/db/{data_connection_id}/tables/{table_path}/upstreams: get: tags: - Explore summary: Get table upstreams description: Retrieve a list of tables which the given table depends on. operationId: >- db_table_upstreams_api_v1_explore_db__data_connection_id__tables__table_path__upstreams_get parameters: - description: >- Unique ID for the Data Connection. Can be found in the Datafold app under Settings > Integrations > Data Connections. in: path name: data_connection_id required: true schema: description: >- Unique ID for the Data Connection. Can be found in the Datafold app under Settings > Integrations > Data Connections. minimum: 1 title: Data Connection ID type: integer - description: >- Path to the table, e.g. `db.schema.table`. The path is case sensitive. If components of the path contain periods, they must be quoted: `db."my.schema"."www.mysite.com visits"`. in: path name: table_path required: true schema: description: >- Path to the table, e.g. `db.schema.table`. The path is case sensitive. If components of the path contain periods, they must be quoted: `db."my.schema"."www.mysite.com visits"`. title: Table Path type: string - description: Maximum depth of the lineage to retrieve. in: query name: max_depth required: false schema: default: 10 description: Maximum depth of the lineage to retrieve. exclusiveMaximum: 100 minimum: 1 title: Max depth type: integer responses: '200': content: application/json: schema: items: $ref: '#/components/schemas/Table' title: >- Response Db Table Upstreams Api V1 Explore Db Data Connection Id Tables Table Path Upstreams Get type: array description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: Table: description: Database table. properties: columns: items: $ref: '#/components/schemas/ColumnReference' title: Columns type: array name: title: Name type: string path: items: type: string title: Table path type: array type: const: Table default: Table title: Type type: string required: - name - columns - path title: Table type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ColumnReference: description: Database table column reference. properties: name: title: Column name type: string required: - name title: ColumnReference type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/get-top-queries-by-execution-count.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get top queries by execution count > Returns the most frequently executed queries with metadata. ## OpenAPI ````yaml openapi-public.json get /api/v1/lineagev2/queries openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineagev2/queries: get: tags: - lineagev2 summary: Get top queries by execution count description: Returns the most frequently executed queries with metadata. operationId: lineagev2_queries parameters: - in: query name: limit required: false schema: default: 100 title: Limit type: integer - in: query name: statement_type required: false schema: anyOf: - type: string - type: 'null' title: Statement Type responses: '200': content: application/json: schema: $ref: '#/components/schemas/QueriesResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: QueriesResponse: properties: queries: items: $ref: '#/components/schemas/QueryInfo' title: Queries type: array required: - queries title: QueriesResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object QueryInfo: properties: avgDurationMs: anyOf: - type: number - type: 'null' title: Avgdurationms executionCount: anyOf: - type: integer - type: 'null' title: Executioncount fingerprint: title: Fingerprint type: string lastExecuted: anyOf: - type: string - type: 'null' title: Lastexecuted normalizedSql: anyOf: - type: string - type: 'null' title: Normalizedsql popularity: anyOf: - type: number - type: 'null' title: Popularity sqlPreview: anyOf: - type: string - type: 'null' title: Sqlpreview statementType: anyOf: - type: string - type: 'null' title: Statementtype uniqueUsers: anyOf: - type: integer - type: 'null' title: Uniqueusers required: - fingerprint title: QueryInfo type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/dma/get-translation-projects.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get translation projects > Get all translation projects for an organization. This is used for DMA v1 and v2, since it's TranslationProject is a SQLAlchemy model. Version is used to track if it's a DMA v1 or v2 project. ## OpenAPI ````yaml openapi-public.json get /api/v1/dma/projects openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/dma/projects: get: tags: - DMA summary: Get translation projects description: >- Get all translation projects for an organization. This is used for DMA v1 and v2, since it's TranslationProject is a SQLAlchemy model. Version is used to track if it's a DMA v1 or v2 project. operationId: list_translation_projects responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiGetTranslationProjectsResponse' description: Successful Response components: schemas: ApiGetTranslationProjectsResponse: properties: projects: items: $ref: '#/components/schemas/ApiTranslationProjectMeta' title: Projects type: array required: - projects title: ApiGetTranslationProjectsResponse type: object ApiTranslationProjectMeta: description: Translation project metadata. Used for DMA v1 and v2. properties: from_data_source_id: anyOf: - type: integer - type: 'null' title: From Data Source Id id: title: Id type: integer name: title: Name type: string org_id: title: Org Id type: integer repo_name: anyOf: - type: string - type: 'null' title: Repo Name temp_data_source_id: anyOf: - type: integer - type: 'null' title: Temp Data Source Id to_data_source_id: anyOf: - type: integer - type: 'null' title: To Data Source Id version: title: Version type: integer required: - id - org_id - version - from_data_source_id - to_data_source_id - name - repo_name - temp_data_source_id title: ApiTranslationProjectMeta type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/dma_v2/get-translation-summaries-for-all-transforms-in-a-project.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Get translation summaries for all transforms in a project > Get translation summaries for all transforms in a project. Returns a list of transform summaries including transform group metadata, validation status, and execution results. Use this to monitor translation progress and identify failed transforms. ## OpenAPI ````yaml openapi-public.json get /api/v1/dma/v2/projects/{project_id}/transforms openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/dma/v2/projects/{project_id}/transforms: get: tags: - DMA_V2 summary: Get translation summaries for all transforms in a project description: >- Get translation summaries for all transforms in a project. Returns a list of transform summaries including transform group metadata, validation status, and execution results. Use this to monitor translation progress and identify failed transforms. operationId: list_transform_summaries parameters: - in: path name: project_id required: true schema: title: Translation project id type: integer responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiListTransformsResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ApiListTransformsResponse: properties: transform_summaries: items: $ref: '#/components/schemas/TransformSummary' title: Transform Summaries type: array required: - transform_summaries title: ApiListTransformsResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object TransformSummary: properties: asset_count: title: Asset Count type: integer iterations: title: Iterations type: integer source: title: Source type: string status: title: Status type: string transform_chain: $ref: '#/components/schemas/TransformChain' transform_group: title: Transform Group type: string uuid: title: Uuid type: string validations: additionalProperties: type: string title: Validations type: object required: - transform_group - iterations - uuid - status - validations - asset_count - source - transform_chain title: TransformSummary type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object TransformChain: properties: chain: items: $ref: '#/components/schemas/TransformCollection' title: Chain type: array root_transform_group: $ref: '#/components/schemas/TransformGroup' required: - root_transform_group - chain title: TransformChain type: object TransformCollection: properties: parent_transform_group: $ref: '#/components/schemas/TransformGroup' transforms: items: $ref: '#/components/schemas/Transform' title: Transforms type: array required: - transforms - parent_transform_group title: TransformCollection type: object TransformGroup: description: >- A TransformGroup defines what operations and database objects are transformed together. There can be multiple TransformGroups pointing to the same set of objects+operations to translate. We do this to allow chained transforms, and to be able to keep track of what is transformed and what is not. properties: computed_pk: default: '--invalid-pk-you-shouldn''t-be-seeing-this-ever--' title: Computed Pk type: string debug: additionalProperties: true title: Debug type: object side: $ref: '#/components/schemas/SideEnum' tags: items: type: string title: Tags type: array uniqueItems: true uuid: format: uuid4 title: Uuid type: string required: - side title: TransformGroup type: object Transform: description: >- Represents a transformation. Transformation can be a translation, splitting, perf optimization, refactoring, etc. Translation is linked to inputs and outputs with InputOf and OutputOf edges. Inputs and outputs are: - operations and database objects that transform acts on - versions of input and output operations (that references FileFragments) - output operations have templated code - looks like hyperedges would be helpful to link (src, target, transform but we can't have this. - we can tag source transform group with "done" tag if transform is Done according to acceptance criteria. We can also add other user-level tags. The target transform group will be marked as "done" only if it's the final transform. This is denormalization, kind of + workflow tracking. - Every iteration produces a new Transform structure. If we start off not from 0, but from another Transform, we add "BuildsOnTopOf" / "DerivedFrom" / "ChildOf" edge. - For a "parentless" transform we always create a new TransformGroup, even if it's a refactor transform in the same database. We put "draft" tag on the TransformGroup so that we don't mess up reporting. When transform is Done, we remove the "draft" tag. For transform with a parent, we reuse the same TransformGroup. If the new transform succeeds, we mark it as the main one. - Validation tracking: we have to validate multiple artifacts and potentially multiple types of artifacts. We create a full set of TransformValidationOfDataset along with Transform, in a single transaction. properties: computed_pk: default: '--invalid-pk-you-shouldn''t-be-seeing-this-ever--' title: Computed Pk type: string debug: additionalProperties: true description: Debug information title: Debug type: object failure_summary: anyOf: - $ref: '#/components/schemas/FailureSummary' - type: 'null' deprecated: true description: >- DEPRECATED: Use TransformGroupSummary artifact instead. Summaries are per transform group attempt, not per transform iteration. lifecycle_state: $ref: '#/components/schemas/TransformLifecycleState' side: $ref: '#/components/schemas/SideEnum' tags: items: type: string title: Tags type: array uniqueItems: true transform_kind: $ref: '#/components/schemas/TransformKind' uuid: format: uuid4 title: Uuid type: string validation_results: items: $ref: '#/components/schemas/ValidationEntry' title: Validation Results type: array required: - side - transform_kind - lifecycle_state - validation_results title: Transform type: object SideEnum: enum: - source - target title: SideEnum type: string FailureSummary: description: >- Structured failure summary with separate problem, error, and solution sections. properties: error_message: title: Error Message type: string location: anyOf: - type: string - type: 'null' title: Location problem: title: Problem type: string reason: $ref: '#/components/schemas/FailureReason' solution: title: Solution type: string required: - problem - error_message - solution - reason title: FailureSummary type: object TransformLifecycleState: enum: - created - running - done - error title: TransformLifecycleState type: string TransformKind: enum: - squash - bundle title: TransformKind type: string ValidationEntry: description: Represents something that was validated, and how it was validated. properties: created_at: format: date-time title: Created At type: string description: description: Description of the performed validation title: Description type: string finished_at: anyOf: - format: date-time type: string - type: 'null' title: Finished At last_updated_at: format: date-time title: Last Updated At type: string notes: description: Optional notes/comments items: type: string title: Notes type: array related_assets: description: >- Assets involved in the validation (mapping between source and destination DWH objects, may be the same for in-db refactorings) items: maxItems: 2 minItems: 2 prefixItems: - $ref: >- #/components/schemas/Gfk_Union_Table__View__SqlSequence__StoredProcedure__UserDefinedFunction__BuiltinFunction__UnresolvedSqlObject__ - $ref: >- #/components/schemas/Gfk_Union_Table__View__SqlSequence__StoredProcedure__UserDefinedFunction__BuiltinFunction__UnresolvedSqlObject__ type: array title: Related Assets type: array requested_validation_kind: $ref: '#/components/schemas/ValidationResultKind' description: Kind of validation performed by this gate result: anyOf: - discriminator: mapping: code_execution: '#/components/schemas/GroupExecutionResult' diff: '#/components/schemas/DiffValidationResult' error: '#/components/schemas/ErrorResult' manual_review: '#/components/schemas/CodeReviewResult' pk_inference: '#/components/schemas/PkInferenceResult' test_case_generation: '#/components/schemas/TestCaseGenerationResult' propertyName: kind oneOf: - $ref: '#/components/schemas/DiffValidationResult' - $ref: '#/components/schemas/CodeReviewResult' - $ref: '#/components/schemas/GroupExecutionResult' - $ref: '#/components/schemas/ErrorResult' - $ref: '#/components/schemas/TestCaseGenerationResult' - $ref: '#/components/schemas/PkInferenceResult' - type: 'null' description: >- Validation result. May be different than requested validation kind (in case of a fatal error for example. title: Result status: $ref: '#/components/schemas/ValidationStatus' description: Current validation status required: - description - status - requested_validation_kind - result - related_assets title: ValidationEntry type: object FailureReason: description: Reasons why an agent failed to complete its task. enum: - max_iterations - tool_error - resignation title: FailureReason type: string ValidationResultKind: enum: - manual_review - diff - code_execution - compilation - test_case_generation - pk_inference - error title: ValidationResultKind type: string DiffValidationResult: description: >- Diff validation result (produced data-sets are expected to match with reference datasets) properties: datadiff_id: description: Identifier of started data-diff title: Datadiff Id type: integer diff_result: anyOf: - $ref: '#/components/schemas/DataDiffResultValues' - type: 'null' description: Data-diff result (null if diff is currently running) has_schema_type_differences: default: false description: >- Whether there are schema differences we care about (e.g., timestamp timezone mismatch) title: Has Schema Type Differences type: boolean human_readable_feedback: anyOf: - type: string - type: 'null' description: >- A human-readable representation of the diff results (typically given as feedback to the human-in-the-loop or LLM agent. title: Human Readable Feedback incremental_iteration: anyOf: - type: integer - type: 'null' deprecated: true description: The iteration number of the incremental operation, if applicable title: Incremental Iteration is_zero_row_diff: default: false description: >- Whether the diff was performed on datasets with zero rows on both sides title: Is Zero Row Diff type: boolean kind: const: diff default: diff title: Kind type: string reference_dataset: $ref: '#/components/schemas/Gfk_DfTable_' description: >- Reference dataset: the reference dataset. Typically, this would be the asset in the source DWH. In the data-diff, this is the left-hand-side table. total_iterations: anyOf: - type: integer - type: 'null' deprecated: true description: >- The total number of iterations of the incremental operation, if applicable title: Total Iterations validated_dataset: $ref: '#/components/schemas/Gfk_DfTable_' description: >- Validated dataset: the dataset produced by the transformed code, compared against the reference dataset. Typically, this would be the asset in the destination DWH (or in the source DWH in case of a refactoring). In the data-diff, this is the right-hand-side table. required: - reference_dataset - validated_dataset - datadiff_id - diff_result - human_readable_feedback title: DiffValidationResult type: object CodeReviewResult: description: >- Code review: a user or agent reviewed the transformed code and approved / rejected with or without comments. properties: feedback: description: Optional feedback left by user / agent items: anyOf: - $ref: '#/components/schemas/GeneralReviewComment' - $ref: '#/components/schemas/CodeChunkComment' title: Feedback type: array kind: const: manual_review default: manual_review title: Kind type: string review_status: $ref: '#/components/schemas/ReviewStatus' description: Review outcome reviewed_by: anyOf: - $ref: '#/components/schemas/User' - $ref: '#/components/schemas/AIAgent' description: Who submitted the review title: Reviewed By required: - reviewed_by - review_status title: CodeReviewResult type: object GroupExecutionResult: description: >- Group execution result (all transformed steps are expected to execute successfully) properties: execution_group: $ref: '#/components/schemas/Gfk_ExecutionGroup_' description: Executed group of operations execution_result_set: anyOf: - $ref: '#/components/schemas/Gfk_ExecutionResultSet_' - type: 'null' description: >- The execution result set (only set if end-to-end execution was successful) failed_execution_steps: description: Steps that failed within the group, and associated error metadata items: $ref: '#/components/schemas/ExecutionErrorInfo' title: Failed Execution Steps type: array kind: const: code_execution default: code_execution title: Kind type: string required: - execution_group - failed_execution_steps - execution_result_set title: GroupExecutionResult type: object ErrorResult: description: >- Generic error result, to be used in cases where the validation errored/crashed. properties: kind: const: error default: error title: Kind type: string message: description: Error message title: Message type: string stack_trace: anyOf: - type: string - type: 'null' description: Optional crash stack trace title: Stack Trace required: - message title: ErrorResult type: object TestCaseGenerationResult: description: TestCaseGenerationResult properties: kind: const: test_case_generation default: test_case_generation title: Kind type: string test_case_generation_error: anyOf: - type: string - type: 'null' title: Test Case Generation Error test_cases: description: Test cases items: $ref: '#/components/schemas/TestCase' title: Test Cases type: array required: - test_cases title: TestCaseGenerationResult type: object PkInferenceResult: description: Primary key inference validation result properties: candidates_found: default: 0 description: Number of PK candidates found during inference title: Candidates Found type: integer dataset: $ref: '#/components/schemas/Gfk_Union_Table__View__DfTable__' description: Dataset for which PK was inferred error_message: anyOf: - type: string - type: 'null' title: Error Message inference_settings: anyOf: - $ref: '#/components/schemas/PkInferenceSettings' - type: 'null' description: Settings used to infer dataset PK inference_success: description: Whether PK inference succeeded title: Inference Success type: boolean inferred_pk_columns: anyOf: - items: type: string type: array - maxItems: 1 minItems: 1 prefixItems: - type: string type: array - type: 'null' title: Inferred Pk Columns kind: const: pk_inference default: pk_inference title: Kind type: string uniqueness_ratio: anyOf: - type: number - type: 'null' title: Uniqueness Ratio required: - dataset - inference_success title: PkInferenceResult type: object ValidationStatus: enum: - pending - running - failed - success title: ValidationStatus type: string DataDiffResultValues: enum: - error - bad-pks - different - missing-pks - identical - empty title: DataDiffResultValues type: string Gfk_DfTable_: maxItems: 2 minItems: 2 prefixItems: - title: Kind type: string - title: Fk type: string type: array GeneralReviewComment: properties: comment: title: Comment type: string comment_kind: const: global default: global title: Comment Kind type: string required: - comment title: GeneralReviewComment type: object CodeChunkComment: properties: comment: title: Comment type: string comment_kind: const: code_chunk default: code_chunk title: Comment Kind type: string end_line_no: title: End Line No type: integer start_line_no: title: Start Line No type: integer required: - start_line_no - end_line_no - comment title: CodeChunkComment type: object ReviewStatus: enum: - signed_off - adjustments_needed title: ReviewStatus type: string User: properties: agent_kind: const: human default: human title: Agent Kind type: string user_id: title: User Id type: integer required: - user_id title: User type: object AIAgent: properties: agent_description: title: Agent Description type: string agent_kind: const: ai_agent default: ai_agent title: Agent Kind type: string required: - agent_description title: AIAgent type: object Gfk_ExecutionGroup_: maxItems: 2 minItems: 2 prefixItems: - title: Kind type: string - title: Fk type: string type: array Gfk_ExecutionResultSet_: maxItems: 2 minItems: 2 prefixItems: - title: Kind type: string - title: Fk type: string type: array ExecutionErrorInfo: properties: compiled_sql: anyOf: - type: string - type: 'null' description: >- Compiled SQL (with Jinja resolved) if available, useful for debugging syntax errors title: Compiled Sql error_message: description: Error received when trying to execute the step title: Error Message type: string execution_step: $ref: '#/components/schemas/Gfk_SqlExecutionStep_' description: Execution step that failed to execute required: - execution_step - error_message title: ExecutionErrorInfo type: object TestCase: description: >- TestCase for validating a number of transformed SQL operations. TestCase provides full data environment required to execute transformed SQL operations in the Target system, and captured "reference" state of the system after the code gets executed. Testcase is generated based on list of operations in the Source system that we want to test, and a mapping between objects in the Source and Target systems. The mapping is nessesary for us to know where to copy the data, and to formulate the test in terms of objects in the Target system, so it's ready to use. `before` and `after` represent (relevant) state of the system before and after the code gets executed. They are maps between: - "canonical" object ref in the system where the test case is to be executed. E.g. if it's a test built for Target system, then those names reference Target system objects. - value is where the actual data is stored. NOTE: - DfTables in `before` are in the Target system so that we can readily execute transformed code with zero friction. - DfTables in `after` may be in the Source or Target system, depending on `config`. We usually don't care much, since we have cross-db diffs. In some cases, though, we copy the data over to use in-db diffs. Sometimes in Live mode we can't connect to the Source system at all, so we are forced to use Target system from the start. This node is linked to: - Self -InputOf-> ExecutionGroup (which execution group this tests) properties: computed_pk: default: '--invalid-pk-you-shouldn''t-be-seeing-this-ever--' title: Computed Pk type: string config: discriminator: mapping: live_full_rebuild: '#/components/schemas/LiveFullRebuildConfig' live_incremental: '#/components/schemas/LiveIncrementalConfig' sandboxed_full_rebuild: '#/components/schemas/SandboxedFullRebuildConfig' sandboxed_incremental: '#/components/schemas/SandboxedIncrementalConfig' propertyName: kind oneOf: - $ref: '#/components/schemas/SandboxedFullRebuildConfig' - $ref: '#/components/schemas/SandboxedIncrementalConfig' - $ref: '#/components/schemas/LiveFullRebuildConfig' - $ref: '#/components/schemas/LiveIncrementalConfig' title: Config debug: additionalProperties: true description: Debug information title: Debug type: object expanded_config: anyOf: - discriminator: mapping: live_full_rebuild: '#/components/schemas/LiveFullRebuildConfig' live_incremental: '#/components/schemas/LiveIncrementalConfig' sandboxed_full_rebuild: '#/components/schemas/SandboxedFullRebuildConfig' sandboxed_incremental: '#/components/schemas/SandboxedIncrementalConfig' propertyName: kind oneOf: - $ref: '#/components/schemas/SandboxedFullRebuildConfig' - $ref: '#/components/schemas/SandboxedIncrementalConfig' - $ref: '#/components/schemas/LiveFullRebuildConfig' - $ref: '#/components/schemas/LiveIncrementalConfig' - type: 'null' title: Expanded Config expected_outputs: additionalProperties: $ref: '#/components/schemas/Gfk_DfTable_' propertyNames: $ref: '#/components/schemas/Gfk_Union_Table__View__' title: Expected Outputs type: object inputs: additionalProperties: $ref: '#/components/schemas/Gfk_DfTable_' propertyNames: $ref: '#/components/schemas/Gfk_Union_Table__View__' title: Inputs type: object is_complete: default: true title: Is Complete type: boolean order: default: 0 title: Order type: integer side: $ref: '#/components/schemas/SideEnum' default: target tags: items: type: string title: Tags type: array uniqueItems: true uuid: format: uuid4 title: Uuid type: string required: - inputs - expected_outputs - config title: TestCase type: object Gfk_Union_Table__View__DfTable__: maxItems: 2 minItems: 2 prefixItems: - title: Kind type: string - title: Fk type: string type: array PkInferenceSettings: properties: allow_null_pks: default: false title: Allow Null Pks type: boolean excluded_columns: anyOf: - items: type: string type: array uniqueItems: true - type: 'null' title: Excluded Columns title: PkInferenceSettings type: object Gfk_SqlExecutionStep_: maxItems: 2 minItems: 2 prefixItems: - title: Kind type: string - title: Fk type: string type: array SandboxedFullRebuildConfig: description: >- Our usual when we can run operations at the source. - Collect input data at the source, - Execute operations, - Copy the data to the destination (we can optionally copy the outputs to use for in-db diffs instead of cross-db diffs), properties: collect_at_destination: $ref: '#/components/schemas/CollectAtDestinationOptions' default: inputs kind: const: sandboxed_full_rebuild default: sandboxed_full_rebuild title: Kind type: string max_synth_scenarios: default: 0 title: Max Synth Scenarios type: integer source: anyOf: - $ref: '#/components/schemas/Freeze' - $ref: '#/components/schemas/TakeSample' - $ref: '#/components/schemas/Synthesize' - $ref: '#/components/schemas/GroupDownsample' - $ref: '#/components/schemas/PointToData' title: Source required: - source title: SandboxedFullRebuildConfig type: object SandboxedIncrementalConfig: description: >- Use to test incremental logic, when existing tables are updated with new data. At t1 we collect outputs in the source data. At t2 we collect inputs, so that they have additional data (hopefully). properties: collect_at_destination: $ref: '#/components/schemas/CollectAtDestinationOptions' default: inputs kind: const: sandboxed_incremental default: sandboxed_incremental title: Kind type: string max_synth_scenarios: default: 0 title: Max Synth Scenarios type: integer source_t1: anyOf: - $ref: '#/components/schemas/Freeze' - $ref: '#/components/schemas/PointToData' title: Source T1 source_t2: anyOf: - $ref: '#/components/schemas/Freeze' - $ref: '#/components/schemas/PointToData' title: Source T2 required: - source_t1 - source_t2 title: SandboxedIncrementalConfig type: object LiveFullRebuildConfig: description: |- Live mode where we can't run anything at the source, and instead we are comparing our results with live prod data, expecting some drift. We support few modes here: - we can freeze source data and copy it to the destination ourselves, - we can point to live prod data & do copy of that, - we can skip data collection at source entrirely, and rely on data captured at destination for both inputs and outputs, if the customer copied the data over already. One fundamental problem with Live mode is that it's not usually enough to capture data as-is and call it a day. If we do it, then data "before" and "after" execution will be the same, and no-op translation (that does literally nothing)will be accepted by validator just fine. So we usually have to mangle the data somehow, e.g. "truncate" the tables that are supposed to be outputs, or chip off some portion of the data (e.g. for the last couple of days). properties: collect_at_destination: $ref: '#/components/schemas/CollectAtDestinationOptions' default: inputs destination_after: discriminator: mapping: apply_sql_filter: '#/components/schemas/ApplySqlFilter' copy_data: '#/components/schemas/CopyData' freeze: '#/components/schemas/Freeze' multiplex_on_mutability: '#/components/schemas/MultiplexOnMutability' point_to_data: '#/components/schemas/PointToData' propertyName: kind oneOf: - $ref: '#/components/schemas/Freeze' - $ref: '#/components/schemas/PointToData' - $ref: '#/components/schemas/MultiplexOnMutability' - $ref: '#/components/schemas/ApplySqlFilter' - $ref: '#/components/schemas/CopyData' title: Destination After destination_before: discriminator: mapping: apply_sql_filter: '#/components/schemas/ApplySqlFilter' copy_data: '#/components/schemas/CopyData' freeze: '#/components/schemas/Freeze' multiplex_on_mutability: '#/components/schemas/MultiplexOnMutability' point_to_data: '#/components/schemas/PointToData' propertyName: kind oneOf: - $ref: '#/components/schemas/Freeze' - $ref: '#/components/schemas/PointToData' - $ref: '#/components/schemas/MultiplexOnMutability' - $ref: '#/components/schemas/ApplySqlFilter' - $ref: '#/components/schemas/CopyData' title: Destination Before kind: const: live_full_rebuild default: live_full_rebuild title: Kind type: string max_synth_scenarios: default: 0 title: Max Synth Scenarios type: integer source_after: anyOf: - discriminator: mapping: apply_sql_filter: '#/components/schemas/ApplySqlFilter' freeze: '#/components/schemas/Freeze' multiplex_on_mutability: '#/components/schemas/MultiplexOnMutability' point_to_data: '#/components/schemas/PointToData' propertyName: kind oneOf: - $ref: '#/components/schemas/Freeze' - $ref: '#/components/schemas/PointToData' - $ref: '#/components/schemas/MultiplexOnMutability' - $ref: '#/components/schemas/ApplySqlFilter' - type: 'null' title: Source After source_before: anyOf: - discriminator: mapping: apply_sql_filter: '#/components/schemas/ApplySqlFilter' freeze: '#/components/schemas/Freeze' multiplex_on_mutability: '#/components/schemas/MultiplexOnMutability' point_to_data: '#/components/schemas/PointToData' propertyName: kind oneOf: - $ref: '#/components/schemas/Freeze' - $ref: '#/components/schemas/PointToData' - $ref: '#/components/schemas/MultiplexOnMutability' - $ref: '#/components/schemas/ApplySqlFilter' - type: 'null' title: Source Before required: - source_before - source_after - destination_before - destination_after title: LiveFullRebuildConfig type: object LiveIncrementalConfig: description: >- Use to test incremental logic, when existing tables are updated with new data. At t1 we collect outputs in the source data. At t2 we collect inputs, so that they have additional data (hopefully). properties: collect_at_destination: $ref: '#/components/schemas/CollectAtDestinationOptions' default: inputs destination_t1: anyOf: - $ref: '#/components/schemas/Freeze' - $ref: '#/components/schemas/PointToData' - $ref: '#/components/schemas/CopyData' title: Destination T1 destination_t2: anyOf: - $ref: '#/components/schemas/Freeze' - $ref: '#/components/schemas/PointToData' - $ref: '#/components/schemas/CopyData' title: Destination T2 kind: const: live_incremental default: live_incremental title: Kind type: string max_synth_scenarios: default: 0 title: Max Synth Scenarios type: integer source_t1: anyOf: - $ref: '#/components/schemas/Freeze' - $ref: '#/components/schemas/PointToData' - type: 'null' title: Source T1 source_t2: anyOf: - $ref: '#/components/schemas/Freeze' - $ref: '#/components/schemas/PointToData' - type: 'null' title: Source T2 required: - source_t1 - source_t2 - destination_t1 - destination_t2 title: LiveIncrementalConfig type: object CollectAtDestinationOptions: description: >- To verify transforms at the destination, we require Inputs to be there. We can optionally collect also "Asset" datasets. If we do it, then we can use in-db diffs instead of cross-db diffs. Additionally we can copy / collect everything that's referenced, just in case. We don't care about non-asset outputs, but in presense of bugs we might want to copy everything that's possibly related to have a full set of data. We'll be more safe if we discover later that we got some mutability wrong, etc. How exactly collection is done (is it freeze, copy, etc.) is determined by testcase config. enum: - inputs - inputs_and_asset_results - inputs_and_all_results title: CollectAtDestinationOptions type: string Freeze: additionalProperties: false description: Create an immutable copy of a table/view. properties: end_time: anyOf: - format: date-time type: string - type: 'null' title: End Time kind: const: freeze default: freeze title: Kind type: string source_path_remap_label: anyOf: - type: string - type: 'null' title: Source Path Remap Label start_time: anyOf: - format: date-time type: string - type: 'null' title: Start Time title: Freeze type: object TakeSample: additionalProperties: false description: >- Uncorrelated sampling, ranges from "pick whatever N rows" to "pick N rows pseudo-randomly". properties: kind: const: take_sample default: take_sample title: Kind type: string method: $ref: '#/components/schemas/SimpleSamplingMethod' target_sample_size: title: Target Sample Size type: integer required: - target_sample_size - method title: TakeSample type: object Synthesize: additionalProperties: false description: Generate a set of data with LLM. It's specific to an execution group. properties: fail_on_incomplete_data: title: Fail On Incomplete Data type: boolean kind: const: synthesize default: synthesize title: Kind type: string max_iterations_per_table: title: Max Iterations Per Table type: integer maximum_rows: title: Maximum Rows type: integer minimum_rows: title: Minimum Rows type: integer scenario: anyOf: - $ref: '#/components/schemas/Scenario' - type: 'null' required: - minimum_rows - maximum_rows - max_iterations_per_table - fail_on_incomplete_data title: Synthesize type: object GroupDownsample: additionalProperties: false description: >- Take a sample out of data produced by any sampler of execution group scope type. We are using this to generate ad-hoc data to test incrementals. properties: base: $ref: '#/components/schemas/Synthesize' kind: const: group_downsample default: group_downsample title: Kind type: string ratio: title: Ratio type: number required: - base - ratio title: GroupDownsample type: object PointToData: additionalProperties: false description: |- Just point to existing tables / views. They could be live prod data, or data pre-frozen by us or customer. TODO: what about pointing to dftable if it's same db? properties: kind: const: point_to_data default: point_to_data title: Kind type: string source_path_remap_label: anyOf: - type: string - type: 'null' title: Source Path Remap Label title: PointToData type: object MultiplexOnMutability: additionalProperties: false description: |- Collection of unholy ad-hoc data mangling strategies, mostly for Live Full Rebuild mode. V1 is here since I expect this to be disposable. Alternative approach could be providing name of properties: kind: const: multiplex_on_mutability default: multiplex_on_mutability title: Kind type: string read_only: anyOf: - $ref: '#/components/schemas/Freeze' - $ref: '#/components/schemas/PointToData' - $ref: '#/components/schemas/CreateEmptyDataset' - $ref: '#/components/schemas/ApplySqlFilter' title: Read Only read_write: anyOf: - $ref: '#/components/schemas/Freeze' - $ref: '#/components/schemas/PointToData' - $ref: '#/components/schemas/CreateEmptyDataset' - $ref: '#/components/schemas/ApplySqlFilter' title: Read Write write_only: anyOf: - $ref: '#/components/schemas/Freeze' - $ref: '#/components/schemas/PointToData' - $ref: '#/components/schemas/CreateEmptyDataset' - $ref: '#/components/schemas/ApplySqlFilter' title: Write Only required: - read_only - write_only - read_write title: MultiplexOnMutability type: object ApplySqlFilter: additionalProperties: false description: Filter data with a SQL query. properties: base: anyOf: - $ref: '#/components/schemas/Freeze' - $ref: '#/components/schemas/PointToData' title: Base kind: const: apply_sql_filter default: apply_sql_filter title: Kind type: string label: title: Label type: string where_clause: title: Where Clause type: string required: - label - base - where_clause title: ApplySqlFilter type: object CopyData: additionalProperties: false description: >- Copy data from one database to another. This might not belong here, actually, and may be need to moved to testcases. properties: kind: const: copy_data default: copy_data title: Kind type: string title: CopyData type: object SimpleSamplingMethod: enum: - limit - system - pseudo_random title: SimpleSamplingMethod type: string Scenario: description: >- Describes what aspect of a query should be tested. Not a Node - just a structured data type embedded in TestCase and Synthesize. Used for type safety and caching (frozen model enables equality comparison). ARCHITECTURE NOTE: Currently the full Scenario is embedded in Synthesize, making it part of DataCollectionConfigNode's cache key. This works because all Scenario fields currently affect data generation. FUTURE: If we add fields that don't affect data (e.g., parameters or vars) but would be executed against the same data, different query branches, we could: 1. Add a `data_cache_key` property that hashes only data-affecting fields of the scenario 2. Modify caching in get_or_create_options_node() to use the data_cache_key property properties: prompt: title: Prompt type: string source: default: llm enum: - llm - project_setting - synthetic title: Source type: string required: - prompt title: Scenario type: object CreateEmptyDataset: additionalProperties: false description: >- Create an empty dataset with same schema as the source dataset but with no data. properties: kind: const: create_empty_dataset default: create_empty_dataset title: Kind type: string title: CreateEmptyDataset type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/deployment-testing/getting-started.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Getting Started with CI/CD Testing > Learn how to set up CI/CD testing with Datafold by integrating your data connections, code repositories, and CI pipeline for automated testing. **TEAM CLOUD** Interested in adding Datafold Team Cloud to your CI pipeline? [Let's talk](https://calendly.com/d/zkz-63b-23q/see-a-demo?email=clay%20analytics%40datafold.com\&first_name=Clay\&last_name=Moeller\&a1=\&month=2024-07)! ## Getting Started with Deployment Testing To get started, first set up your [data connection](https://docs.datafold.com/integrations/databases) to ensure that Datafold can access and monitor your data sources. Next, integrate Datafold with your version control system by following the instructions for [code repositories](https://docs.datafold.com/integrations/code-repositories). This allows Datafold to track and test changes in your data pipelines. Add Datafold to your continuous integration (CI) pipeline to enable automated deployment testing. You can do this through our universal [Fully-Automated](../deployment-testing/getting-started/universal/fully-automated), [No-Code](../deployment-testing/getting-started/universal/no-code), [API](../deployment-testing/getting-started/universal/api), or [dbt](../integrations/orchestrators) integrations. Optionally, you can [connect data apps](https://docs.datafold.com/integrations/bi_data_apps) to extend your testing and monitoring to data applications like BI tools. --- # Source: https://docs.datafold.com/integrations/code-repositories/github.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # GitHub **PREREQUISITES** * Datafold Admin role * Your GitHub account must be a member of the GitHub organization where the Datafold app is to be installed * Approval of your request to add the Datafold app to your repo must be granted by a GitHub repo admin or GitHub organization owner. To set up a new integration, click the repository field and select the **Install GitHub app** button. From here, GitHub will redirect you to login to your account and choose which organization you would like to connect. After choosing the right organization, you may choose to allow access to all repositories or specific ones. Once complete, you will be redirected back to Datafold, where you can select the appropriate repository for connection. **TIP** If you lack permission to add the Datafold app, request approval from a GitHub admin. After installation, click **Refresh** to display the newly added repositories in the dropdown list. To complete the setup, click **Save**! **INFO** VPC deployments are an Enterprise feature. Please email [sales@datafold.com](mailto:sales@datafold.com) to enable your account. ## GitHub integration for VPC / single-tenant Datafold deployments ### Create a GitHub application VPC clients of Datafold need to create their own GitHub app, rather than use the shared Datafold GitHub application. Start by navigating to **Settings** → **Global Settings**. To begin the set up process, enter the domain that was registered for the VPC deployment in [AWS](/datafold-deployment/dedicated-cloud/aws) or [GCP](/datafold-deployment/dedicated-cloud/gcp). Then, enter the name of the GitHub organization where you'd like to install the application. When filled, click **Create GitHub App**. This will redirect the admin to GitHub, where they may need to authenticate. **The GitHub user must be an admin of the GitHub organization.** After authentication, you should be directed to enter a description for the GitHub App. After entering the description, click **Create GitHub app**. Once the application is created, you should be returned to the Datafold settings screen. The button should then have disappeared, and the details for the GitHub App should be visible. ### Making the GitHub application public If you have a private GitHub instance with multiple organizations and want to use the Datafold app across all of them, you'll need to make the app public on your private server. You can do so in GitHub by following these steps: 1. Navigate to the GitHub organization where the app was created. 2. Click **Settings**. 3. Go to **Developer Settings** → **GitHub Apps**. 4. Select the **Datafold app**. 5. Click **Advanced**, then **Make public**. The app will be public **only on your private GitHub server**, ensuring it can be accessed across all your organizations. ### Configure GitHub in Datafold If you see this screen with all the details, you've successfully created a GitHub App! Now that the app is created, you have to install it using the [GitHub integration setup](/integrations/code-repositories/github). --- # Source: https://docs.datafold.com/integrations/code-repositories/gitlab.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # GitLab To get the [project access token](https://docs.gitlab.com/ee/user/project/settings/project%5Faccess%5Ftokens.html), navigate to your GitLab project settings and create a new token. **TIP** Project access tokens are preferred over personal tokens for security. When configuring your token, select the **Maintainer** role and select the **api** scope. **Project Name** is your Gitlab project URL after `gitlab.com/`. For example, if your Gitlab project URL is `https://gitlab.com/datafold/dbt/`, your Project Name is `datafold/dbt/` Finally, navigate back to Datafold and enter the **Project Token** and the name of your **Project** before hitting **Save**: If you want to change the GitLab URL, you can do so after setting up the integration. To do so, navigate to **Settings**, then **Org Settings**: --- # Source: https://docs.datafold.com/integrations/databases/google-cloud-storage.md # Google Cloud Storage (GCS) **Steps to complete:** 1. [Create a Service Account](/integrations/databases/google-cloud-storage#create-a-service-account) 2. [Give the Service Account Storage Object Admin access](/integrations/databases/google-cloud-storage#service-account-access-and-permissions) 3. [Generate a Service Account JSON key](/integrations/databases/google-cloud-storage#generate-a-service-account-key) 4. [Configure your data connection in Datafold](/integrations/databases/google-cloud-storage#configure-in-datafold) ## Create a Service Account To connect Datafold to your Google Cloud Storage bucket, you will need to create a *service account* for Datafold to use. * Navigate to the [Google Cloud Console](https://console.cloud.google.com/), click on the drop-down to the left of the search bar, and select the project you want to connect to. * *Note: If you do not see your project, you may need to switch accounts.* * Click on the hamburger menu in the upper left, then select **IAM & Admin** followed by **Service Accounts**. * Create a service account named `Datafold`. ## Service Account Access and Permissions The Datafold service account requires the following roles and permissions: * **Storage Object Admin** for read and write access on all the datasets in the project. ## Generate a Service Account Key Next, go back to the **IAM & Admin** page to generate a key for Datafold. * Click on the **Service Accounts** page. * Click on the **Datafold** service account. * Click on the **Keys** tab. * Click on **Add Key** and select **Create new key**. * Select **JSON** and click **Create**. We recommend using the JSON formatted key. After creating the key, it will be saved on your local machine. ## Configure in Datafold | Field Name | Description | | --------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------- | | Connection name | A name given to the data connection within Datafold | | Bucket Name | The name of the bucket you want to connect to. | | Bucket region | The region of the bucket you want to connect to. | | JSON Key File | The key file generated in the [Generate a Service Account JSON key](/integrations/databases/google-cloud-storage#generate-a-service-account-key) step | | Directory for writing diff results | Optional. The directory in the bucket where diff results will be written. Service account should have write access to this directory. | | Default maximum number of rows to include in diff results | Optional. The maximum number of rows that a file with materialized results will contain. | Click **Create**. Your data connection is ready! --- # Source: https://docs.datafold.com/security/single-sign-on/google-oauth.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Google OAuth **NOTE** Google SSO is available for both SaaS and VPC installations of Datafold. ## Datafold SaaS For Datafold SaaS the setup only involves enabling Google SSO integration. If Google SSO is already enabled for your organization you will see it in the **Settings** → **Integrations** → **SSO**. If this is not the case, create a new Google SSO Integration by clicking on the **Add new integration** button. Enable the **Allow Google logins in organization** switch and click **Save**. That's it! If you are not using Datafold SaaS, please see below. ## Create OAuth Client ID To begin, navigate to the [Google admin console](https://console.cloud.google.com/apis/credentials?authuser=1%5C\&folder=%5C) for your organization, click **Create Credentials**, and select **OAuth Client ID**. **TIP** To configure OAuth, you may need to first configure your consent screen. We recommend selecting **Internal** to keep access limited to users in your Google workspace and organization. ### Configure OAuth[](#configure-oauth "Direct link to Configure OAuth") * **Application type**: "Web application" * **Authorized JavaScript origins**: `https://` * **Authorized redirect URIs**: `https:///oauth/google` Finally, click **Create**. You will see a set of credentials that you will copy over to your Datafold Global Settings. ## Configure Google OAuth in Datafold To finish the configuration, create a Google SSO Integration in Datafold. To complete the integration in Datafold, create a new integration by navigating to **Settings** → **Integrations** → **SSO** → **Add new integration** → **Google**. * Enable the **Google OAuth** switch. * Enter the **domain** or URL of your OAuth client Id on the respective field. * Paste the **Client Secret** on the respective field. * Enable the **Allow Google logins in Organization** switch. * Finally, click **Save**. --- # Source: https://docs.datafold.com/security/single-sign-on/saml/examples/google.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Google ## Google as a SAML Identity Provider Enable SAML in your Google Workspace. Check [Set up your own custom SAML app](https://support.google.com/a/answer/6087519?hl=en) for more details. **CAUTION** You need to be a **super-admin** in the Google Workspace to configure a SAML application. * Go to `Google`, click on **Download Metadata** in the left sidebar and **copy** the XML. * Select **Email** as the Name ID format. * Select **Basic Information > Primary email** as the Name ID. * Go to `Datafold` and create a new SSO integration. Navigate to **Settings** → **Integrations** → **Add new integration** → **SAML**. * Copy the read-only field **Service Provider ACS URL**, go to `Google` and paste it into **ACS URL**. * Copy the read-only field **Service Provider Entity ID**, go to `Google` and paste it into **Entity ID**. * Paste the **copied** XML into `Datafold`'s **Identity Provider Metadata XML** field. * Click **Save** to create the integration. * (Optional step) Configure the attribute mapping as follows: * **First Name** → `first_name` * **Last Name** → `last_name` --- # Source: https://docs.datafold.com/security/single-sign-on/saml/group-provisioning.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. > Automatically sync group membership with your SAML Identity Provider (IdP). # null ## 1. Create desired groups in the IdP ## 2. Assign the desired users to groups Assign the relevant users to groups reflecting their roles and permissions. ## 3. Configure the SAML SSO provider Configure your SAML SSO provider to include a `groups` attribute. This attribute should list all the groups you want to sync. ```Bash theme={null} datafold_admindatafold_read_write ``` ## 4. Map IdP groups to Datafold groups The `datafold_admin` group, created in the IdP through [step 1](#1-create-desired-groups-in-the-idp), will be automatically synced. Users in this IdP group will also be members of the corresponding group in Datafold. **Note:** Manual Datafold user group memberships will be overridden upon the user's next login to Datafold. Therefore, group memberships should be managed exclusively within the IdP once the `groups` attribute is configured. ## Example configuration Here's how you might configure three groups to map to the three default Datafold groups, `admin`, `default` and `viewonly`: --- # Source: https://docs.datafold.com/deployment-testing/best-practices/handling-data-drift.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Handling Data Drift > Ensuring Datafold in CI executes apples-to-apples comparison between staging and production environments. **Note** This section of the docs is only relevant if the data used as inputs during the PR build are inconsistent with the data used as inputs during the last production build. Please contact [support@datafold.com](mailto:support@datafold.com) if you'd like to learn more. ## What is data drift in CI? Datafold is used in CI to illuminate the impact of a pull request's proposed code change by comparing two versions of the data and identifying differences. **Data drift in CI** happens when those data differences occur due to *changes in upstream data sources*—not because of proposed code changes. Data drift in CI adds "noise" to your CI testing analysis, making it tricky to tell if data differences are due to new code, or changes in the source data. Unless both versions rely on the same snapshot of upstream data, data drift can compromise your ability to see the true effect of the code changes. **Tip** dbt users should implement Slim CI in [dbt Core](https://www.datafold.com/blog/taking-your-dbt-ci-pipeline-to-the-next-level) or [dbt Cloud](https://www.datafold.com/blog/slim-ci-the-cost-effective-solution-for-successful-deployments-in-dbt-cloud) to prevent most instances of data drift. Slim CI reduces build time and eliminates most instances of data drift because the CI build depends on upstreams in production due to state deferral. However, Slim CI will not *completely* eliminate data drift in CI, specifically in cases where the model being modified in the PR depends on a source. In those cases, we recommend [**building twice in CI**](/deployment-testing/best-practices/handling-data-drift#build-twice-in-ci). ## Why prevent data drift in CI? By eliminating data drift entirely, you can be confident that any differences detected in CI are driven only by your code, not unexpected data changes. You can think of this as similar to a scientific experiment, where the control versus treatment groups ideally exist in identical baseline conditions, with the treatment as the only variable which would cause differential outcomes. In practice, many organizations do not completely eliminate data drift, and still derive value from automatic data diffing and analysis conducted by Datafold in CI, in spite of minor noise that does exist. ## Handling data drift We recommend two options for removing data drift to the greatest extent possible: * [Build twice in CI](#build-twice-in-ci) * [Build CI data from clone of prod sources](#build-ci-data-from-clone-of-prod-sources) In both of these approaches, Datafold compares transformations of identical upstream data, so that any detected differences will be due to the code changes alone, ensuring an accurate comparison with no false positives. By building two versions of the data in CI, you can ensure an "apples-to-apples" comparison that depends on the same version of upstream data. When deciding between the two, choose the one that best matches your workflow: | Workflow | Approach | Why | | ----------------------------------------------------- | ----------------------------- | --------------------------------------------------------------------------------------------- | | Data changes frequently in production | Build twice in CI | Isolates PR impact without waiting on recent production updates, using a consistent snapshot. | | Production has complex orchestration or multiple jobs | Build CI data from prod clone | Allows a stable comparison by freezing upstream data from a fixed production state. | | Performance and speed are critical | Build CI data from prod clone | Limits CI build to a single snapshot, reducing the processing load on the pipeline. | | Simplified orchestration with minimal dependencies | Build twice in CI | Reduces the need to manage production snapshots by running both environments in CI. | ### Build twice in CI This method involves two CI builds: one representing PR data, and another representing production data, both based on an identical snapshot of upstream data. 1. Create a fixed snapshot of the upstream data that both builds will use. 2. The CI pipeline executes two builds: one using the PR branch of code, and another using the base branch of code. 3. Datafold compares these two data environments, both created in CI, and detects differences. If performance is a concern, you can use a reduced or filtered upstream data set to speed up the CI process while still providing rich insight into the data. This method assumes the production build doesn’t involve multiple jobs that process different sets of models at different times. ### Build CI data from clone of prod sources This method involves comparing a CI build based on a snapshot of the upstream source data *from the time of the last production build* to the production version of transformed data. 1. Update orchestration to create and store a snapshot of the upstream source data at the time of the production transformation job. 2. The CI pipeline executes a data transformation build using the PR branch of code, with the snapshotted upstream data as the upstream source. 3. Datafold compares the CI data environment with production data and detects differences. --- # Source: https://docs.datafold.com/integrations/bi-data-apps/hightouch.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Hightouch > Navigate to Settings > Integrations > Data Apps and add a Hightouch Integration. ## Create a Hightouch Integration Complete the configuration by specifying the following fields: | Field Name | Description | | ----------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Integration name | An identifier used in Datafold to identify this Data App configuration. | | Workspace URL | Then, grab your workspace URL, by navigating to **Settings** → **Workspace** tab → **Workspace slug** or by finding the workspace name in the search bar ([https://app.hightouch.io/](https://app.hightouch.io/) \). | | API Key | Log into your [Hightouch account](https://app.hightouch.com/login) and navigate to **Settings** → **API keys** tab → **Add API key** to generate a new, unique API key. Your API key will appear only once, so please copy and save it to your password manager for further use. | | Data connection mapping | When the correct credentials are entered we will begin to populate data connections in Hightouch (on the left side) that will need to be mapped to data connections configured in Datafold (on the right side). See image below. | When completed, click **Submit**. It may take some time to sync all the Hightouch entities to Datafold and for Data Explorer to populate. When completed, your Hightouch models and sync will appear in Data Explorer as search results. **TIP** [Tracking Jobs](/integrations/bi-data-apps/tracking-jobs) explains how to find out when your data app integration is ready. --- # Source: https://docs.datafold.com/data-diff/how-datafold-diffs-data.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # How Datafold Diffs Data > Data diffs allow you to perform value-level comparisons between any two datasets within the same database, across different databases, or even between files. The basic inputs required to run a diff are the data connections, names/paths of the datasets to be compared, and the primary key (one or more columns that uniquely identify rows in the datasets). ## What types of data can data diffs compare? Diffs can compare data in tables, views, SQL queries (in relational databases and data lakes), and even files (e.g. CSV, Excel, Parquet, etc.). Datafold facilitates data diffing by supporting a wide range of basic data types across major database systems like Snowflake, Databricks, BigQuery, Redshift, PostgreSQL, and many more. ## Creating data diffs Diffs can be created in several ways: * Interactively through the Datafold app * Programmatically via our [REST API](/api-reference/data-diffs/create-a-data-diff) * As part of a Continuous Integration (CI) workflow for [Deployment Testing](/deployment-testing/how-it-works) ## How in-database diffing works When diffing data within the same physical database or data lake namespace, diffs compare data by executing various SQL queries in the target database. It uses several `JOIN`-type queries and various aggregate queries to provide detailed insights into differences at the row, value, and column levels, and to calculate differences in metrics and distributions. ## How cross-database diffing works Datasets from both data connections are co-located in a centralized database to execute comparisons and identify specific rows, columns, and values with differences. To perform diffs at massive scale and increased speed, users can apply sampling, filtering, and column selection. --- # Source: https://docs.datafold.com/deployment-testing/how-it-works.md # Source: https://docs.datafold.com/data-explorer/how-it-works.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # How It Works > The UI visually maps workflows and tracks column-level or tabular lineages, helping users understand the impact of upstream changes. Our **Data Explorer** offers a comprehensive overview of your data assets, including [Lineage](/data-explorer/lineage) and [Profiles](/data-explorer/profile). You can filter data assets by Data Connections, Tags, Data Owners, and Asset Types (e.g., tables, columns, and BI-created assets such as views, reports, and syncs). You can also search directly to find specific data assets for lineage analysis. After selecting a table or data asset, the UI will display a **graph of table-level lineage** by default. You can toggle between **Upstream** and **Downstream** perspectives and customize the lineage view by adjusting the **Max Depth** parameter to your preference. --- # Source: https://docs.datafold.com/deployment-testing/configuration/model-specific-ci/including-excluding-columns.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Including/Excluding Columns > Specify columns to include or exclude from the data diff using `include_columns` and `exclude_columns`. ```Bash theme={null} models: - name: users meta: datafold: datadiff: include_columns: - user_id - created_at - name exclude_columns: - full_name ``` --- # Source: https://docs.datafold.com/api-reference/introduction.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Introduction Our REST API allows you to interact with Datafold programmatically. To use it, you'll need an API key. Follow the instructions below to get started. ## Create an API Key Open the Datafold app, visit Settings > Account, and select **Create API Key**. Store your API key somewhere safe. If you lose it, you'll need to generate a new one. Create an API key ## Use your API Key When making requests to the Datafold API, you'll need to include the API key as a header in your HTTP request for authentication. The header should be named `Authorization`, and the value should be in the format: ``` Authorization: Key {API_KEY} ``` For example, if you're using cURL: ```bash theme={null} curl https://app.datafold.com/api/v1/... -H "Authorization: Key {API_KEY}" ``` ## Datafold SDK Rather than hit our REST API endpoints directly, we offer a convenient Python SDK for common development and deployment testing workflows. You can find more information about our SDK [here](/api-reference/datafold-sdk). ## Need help? If you have any questions about how to use our REST API, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com). --- # Source: https://docs.datafold.com/data-explorer/lineage.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Lineage > Datafold offers a column-level and tabular lineage view. ## Column-level lineage Datafold's column-level lineage helps users trace and document the history, transformations, dependencies, and both downstream and upstream processes of a specific data column within an organization's data assets. This feature allows you to pinpoint the origins of data validation issues and comprehensively identify downstream data processes and applications. To view column-level lineage, click on the **Columns** dropdown menu of the selected asset. ### Highlight path between assets To highlight the column path between assets, click the specific column. Reset the view by clicking the **Exit the selected path** button. ## Tabular lineage Datafold also offers a tabular lineage view. You can sort lineage information by depth, asset type, identifier, and owner. Click on the **Actions** button for further options: ### Focus lineage on current node Drill down onto the data node or column of interest. ### Show SQL query Access the SQL query associated with the selected column to understand how the data was queried from the source: ### Show usage details Access detailed information about the column's read, write, and cumulative read (the sum of read count including read count of downstream columns) for the previous 7 days: ## Search and filters Datafold offers powerful search and filtering capabilities to help users quickly locate specific data assets and isolate data connections of interest. In both the graphical and tabular lineage views, you can filter by tables or columns within tables, allowing you to go as granular as needed. ### Table filtering Simply enter the table's name in the search bar to filter and display all relevant information associated with that table. ### Column filtering To focus specifically on columns, you can search using a combination of keywords. For instance, searching "column table" will display columns associated with a table, while a query like "column dim customer" narrows the search to columns within the "dim customer" table. ## Settings You can configure the settings for Lineage under Settings > Data Connections > Advanced Settings: ### Schema indexing schedule Customize the frequency and timing of when to update the indexes on database schemas. The schedule is defined through a cron tab expression. ### Table inclusion/exclusion You can filter to include and/or exclude specific tables to be shown in Lineage. When the inclusion list is set, only the tables specified in this list will be visible in the lineage and search results. When the inclusion list is not set, all tables will be visible by default, except for those explicitly specified in the exclusion list. ### Lineage update schedule Customize the frequency and timing of when to scan the query history of your data warehouse to build and update the data lineage. The schedule is defined through a cron tab expression. ## FAQ Datafold computes column-level lineage by: 1. Ingesting, parsing and analyzing SQL logs from your databases and data warehouses. This allows Datafold to infer dependencies between SQL statements, including those that create, modify, and read data. 2. Augmenting the metadata graph with data from various sources. This includes metadata from orchestration tools (e.g., dbt), BI tools, and user-provided documentation. Currently, the schema of the Datafold GraphQL API, which we use to expose lineage information, is not yet stable and is considered to be in beta. Therefore, we do not include this API in our public documentation. If you would like to programmatically access lineage information, you can explore our GitHub repository with a few examples: [datafold/datafold-api-examples](https://github.com/datafold/datafold-api-examples). Simply clone the repository and follow the instructions provided in the `README.md` file. --- # Source: https://docs.datafold.com/api-reference/bi/list-all-integrations.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # List all integrations > Return all integrations for Mode/Tableau/Looker ## OpenAPI ````yaml get /api/v1/lineage/bi/ openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineage/bi/: get: tags: - BI summary: List all integrations description: Return all integrations for Mode/Tableau/Looker operationId: get_all_integrations_api_v1_lineage_bi__get responses: '200': content: application/json: schema: {} description: Successful Response components: securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/ci/list-ci-runs.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # List CI runs ## OpenAPI ````yaml get /api/v1/ci/{ci_config_id}/runs openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/ci/{ci_config_id}/runs: get: tags: - CI summary: List CI runs operationId: get_ci_api_v1_ci__ci_config_id__runs_get parameters: - in: path name: ci_config_id required: true schema: title: CI config id type: integer - in: query name: pr_sha required: false schema: anyOf: - type: string - type: 'null' title: Pr Sha - in: query name: pr_num required: false schema: anyOf: - type: string - type: 'null' title: Pr Num - in: query name: limit required: false schema: default: 100 title: Limit type: integer - in: query name: offset required: false schema: default: 0 title: Offset type: integer responses: '200': content: application/json: schema: items: $ref: '#/components/schemas/ApiCiRun' title: Response Get Ci Api V1 Ci Ci Config Id Runs Get type: array description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ApiCiRun: properties: base_branch: title: Base Branch type: string base_sha: title: Base Sha type: string id: title: Id type: integer pr_branch: title: Pr Branch type: string pr_num: title: Pr Num type: string pr_sha: title: Pr Sha type: string source: title: Source type: string status: title: Status type: string required: - id - base_branch - base_sha - pr_branch - pr_sha - pr_num - status - source title: ApiCiRun type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/data-diffs/list-data-diffs.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # List data diffs > All fields support multiple items, using just comma delimiter Date fields also support ranges using the following syntax: - ``DATETIME`` = after DATETIME - ``DATETIME`` = between DATETIME and DATETIME + 1 MINUTE - ``DATE`` = start of that DATE until DATE + 1 DAY - ``DATETIME1<- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/datadiffs: get: tags: - Data diffs summary: List data diffs description: |- All fields support multiple items, using just comma delimiter Date fields also support ranges using the following syntax: - ``DATETIME`` = after DATETIME - ``DATETIME`` = between DATETIME and DATETIME + 1 MINUTE - ``DATE`` = start of that DATE until DATE + 1 DAY - ``DATETIME1<' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/data-sources/list-data-source-types.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # List data source types ## OpenAPI ````yaml get /api/v1/data_sources/types openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/data_sources/types: get: tags: - Data sources summary: List data source types operationId: get_data_source_types_api_v1_data_sources_types_get responses: '200': content: application/json: schema: items: $ref: '#/components/schemas/ApiDataSourceType' title: Response Get Data Source Types Api V1 Data Sources Types Get type: array description: Successful Response components: schemas: ApiDataSourceType: properties: configuration_schema: additionalProperties: true title: Configuration Schema type: object features: items: type: string title: Features type: array name: title: Name type: string type: title: Type type: string required: - name - type - configuration_schema - features title: ApiDataSourceType type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/data-sources/list-data-sources.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # List data sources > Retrieves all data sources accessible to the authenticated user. Returns active data sources (not deleted, hidden, or draft) that the user has permission to access. For non-admin users, only data sources belonging to their assigned groups are returned. ## OpenAPI ````yaml get /api/v1/data_sources openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/data_sources: get: tags: - Data sources summary: List data sources description: >- Retrieves all data sources accessible to the authenticated user. Returns active data sources (not deleted, hidden, or draft) that the user has permission to access. For non-admin users, only data sources belonging to their assigned groups are returned. operationId: list_data_sources responses: '200': content: application/json: schema: items: discriminator: mapping: athena: '#/components/schemas/ApiDataSourceAwsAthena' aws_s3: '#/components/schemas/ApiDataSourceS3' azure_synapse: '#/components/schemas/ApiDataSourceAzureSynapse' bigquery: '#/components/schemas/ApiDataSourceBigQuery' databricks: '#/components/schemas/ApiDataSourceDatabricks' dremio: '#/components/schemas/ApiDataSourceDremio' duckdb: '#/components/schemas/ApiDataSourceDuckDB' files_azure_datalake: '#/components/schemas/ApiDataSourceAzureDataLake' google_cloud_storage: '#/components/schemas/ApiDataSourceGCS' mariadb: '#/components/schemas/ApiDataSourceMariaDB' microsoft_fabric: '#/components/schemas/ApiDataSourceMicrosoftFabric' mongodb: '#/components/schemas/ApiDataSourceMongoDB' mssql: '#/components/schemas/ApiDataSourceMSSQL' mysql: '#/components/schemas/ApiDataSourceMySQL' netezza: '#/components/schemas/ApiDataSourceNetezza' oracle: '#/components/schemas/ApiDataSourceOracle' pg: '#/components/schemas/ApiDataSourcePostgres' postgres_aurora: '#/components/schemas/ApiDataSourcePostgresAurora' postgres_aws_rds: '#/components/schemas/ApiDataSourcePostgresRds' redshift: '#/components/schemas/ApiDataSourceRedshift' sap_hana: '#/components/schemas/ApiDataSourceSapHana' snowflake: '#/components/schemas/ApiDataSourceSnowflake' starburst: '#/components/schemas/ApiDataSourceStarburst' teradata: '#/components/schemas/ApiDataSourceTeradata' trino: '#/components/schemas/ApiDataSourceTrino' vertica: '#/components/schemas/ApiDataSourceVertica' propertyName: type oneOf: - $ref: '#/components/schemas/ApiDataSourceBigQuery' - $ref: '#/components/schemas/ApiDataSourceDatabricks' - $ref: '#/components/schemas/ApiDataSourceDuckDB' - $ref: '#/components/schemas/ApiDataSourceMongoDB' - $ref: '#/components/schemas/ApiDataSourceMySQL' - $ref: '#/components/schemas/ApiDataSourceMariaDB' - $ref: '#/components/schemas/ApiDataSourceMSSQL' - $ref: '#/components/schemas/ApiDataSourceOracle' - $ref: '#/components/schemas/ApiDataSourcePostgres' - $ref: '#/components/schemas/ApiDataSourcePostgresAurora' - $ref: '#/components/schemas/ApiDataSourcePostgresRds' - $ref: '#/components/schemas/ApiDataSourceRedshift' - $ref: '#/components/schemas/ApiDataSourceTeradata' - $ref: '#/components/schemas/ApiDataSourceSapHana' - $ref: '#/components/schemas/ApiDataSourceAwsAthena' - $ref: '#/components/schemas/ApiDataSourceSnowflake' - $ref: '#/components/schemas/ApiDataSourceDremio' - $ref: '#/components/schemas/ApiDataSourceStarburst' - $ref: '#/components/schemas/ApiDataSourceNetezza' - $ref: '#/components/schemas/ApiDataSourceAzureDataLake' - $ref: '#/components/schemas/ApiDataSourceGCS' - $ref: '#/components/schemas/ApiDataSourceS3' - $ref: '#/components/schemas/ApiDataSourceAzureSynapse' - $ref: '#/components/schemas/ApiDataSourceMicrosoftFabric' - $ref: '#/components/schemas/ApiDataSourceVertica' - $ref: '#/components/schemas/ApiDataSourceTrino' title: Response List Data Sources type: array description: Successful Response components: schemas: ApiDataSourceBigQuery: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/BigQueryConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: bigquery title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceBigQuery type: object ApiDataSourceDatabricks: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/DatabricksConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: databricks title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceDatabricks type: object ApiDataSourceDuckDB: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/DuckDBConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: duckdb title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceDuckDB type: object ApiDataSourceMongoDB: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/MongoDBConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: mongodb title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceMongoDB type: object ApiDataSourceMySQL: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/MySQLConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: mysql title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceMySQL type: object ApiDataSourceMariaDB: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/MariaDBConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: mariadb title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceMariaDB type: object ApiDataSourceMSSQL: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/MSSQLConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: mssql title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceMSSQL type: object ApiDataSourceOracle: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/OracleConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: oracle title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceOracle type: object ApiDataSourcePostgres: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/PostgreSQLConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: pg title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourcePostgres type: object ApiDataSourcePostgresAurora: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/PostgreSQLAuroraConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: postgres_aurora title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourcePostgresAurora type: object ApiDataSourcePostgresRds: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/PostgreSQLAuroraConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: postgres_aws_rds title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourcePostgresRds type: object ApiDataSourceRedshift: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/RedshiftConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: redshift title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceRedshift type: object ApiDataSourceTeradata: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/TeradataConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: teradata title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceTeradata type: object ApiDataSourceSapHana: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/SapHanaConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: sap_hana title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceSapHana type: object ApiDataSourceAwsAthena: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/AwsAthenaConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: athena title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceAwsAthena type: object ApiDataSourceSnowflake: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/SnowflakeConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: snowflake title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceSnowflake type: object ApiDataSourceDremio: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/DremioConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: dremio title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceDremio type: object ApiDataSourceStarburst: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/StarburstConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: starburst title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceStarburst type: object ApiDataSourceNetezza: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/NetezzaConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: netezza title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceNetezza type: object ApiDataSourceAzureDataLake: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/AzureDataLakeConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: files_azure_datalake title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceAzureDataLake type: object ApiDataSourceGCS: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/GCSConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: google_cloud_storage title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceGCS type: object ApiDataSourceS3: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/AWSS3Config' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: aws_s3 title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceS3 type: object ApiDataSourceAzureSynapse: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/MSSQLConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: azure_synapse title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceAzureSynapse type: object ApiDataSourceMicrosoftFabric: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/MicrosoftFabricConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: microsoft_fabric title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceMicrosoftFabric type: object ApiDataSourceVertica: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/VerticaConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: vertica title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceVertica type: object ApiDataSourceTrino: properties: catalog_exclude_list: anyOf: - type: string - type: 'null' title: Catalog Exclude List catalog_include_list: anyOf: - type: string - type: 'null' title: Catalog Include List created_from: anyOf: - type: string - type: 'null' title: Created From data_retention_days: anyOf: - type: integer - type: 'null' title: Data Retention Days disable_profiling: anyOf: - type: boolean - type: 'null' default: false title: Disable Profiling disable_schema_indexing: anyOf: - type: boolean - type: 'null' default: false title: Disable Schema Indexing float_tolerance: anyOf: - type: number - type: 'null' default: 0 title: Float Tolerance groups: anyOf: - additionalProperties: type: boolean type: object - type: 'null' title: Groups hidden: anyOf: - type: boolean - type: 'null' default: false title: Hidden id: anyOf: - type: integer - type: 'null' title: Id is_paused: anyOf: - type: boolean - type: 'null' default: false title: Is Paused last_test: anyOf: - $ref: '#/components/schemas/ApiDataSourceTestStatus' - type: 'null' lineage_schedule: anyOf: - type: string - type: 'null' title: Lineage Schedule max_allowed_connections: anyOf: - type: integer - type: 'null' title: Max Allowed Connections name: title: Name type: string oauth_dwh_active: anyOf: - type: boolean - type: 'null' title: Oauth Dwh Active options: anyOf: - $ref: '#/components/schemas/TrinoConfig' - type: 'null' profile_exclude_list: anyOf: - type: string - type: 'null' title: Profile Exclude List profile_include_list: anyOf: - type: string - type: 'null' title: Profile Include List profile_schedule: anyOf: - type: string - type: 'null' title: Profile Schedule queue_name: anyOf: - type: string - type: 'null' title: Queue Name scheduled_queue_name: anyOf: - type: string - type: 'null' title: Scheduled Queue Name schema_indexing_schedule: anyOf: - type: string - type: 'null' title: Schema Indexing Schedule schema_max_age_s: anyOf: - type: integer - type: 'null' title: Schema Max Age S secret_id: anyOf: - type: integer - type: 'null' title: Secret Id source: anyOf: - type: string - type: 'null' title: Source temp_schema: anyOf: - type: string - type: 'null' title: Temp Schema type: const: trino title: Type type: string view_only: anyOf: - type: boolean - type: 'null' default: false title: View Only required: - name - type title: ApiDataSourceTrino type: object ApiDataSourceTestStatus: properties: results: items: $ref: '#/components/schemas/TestResultStep' title: Results type: array tested_at: format: date-time title: Tested At type: string required: - tested_at - results title: ApiDataSourceTestStatus type: object BigQueryConfig: properties: extraProjectsToIndex: anyOf: - type: string - type: 'null' examples: - |- project1 project2 section: config title: List of extra projects to index (one per line) widget: multiline jsonKeyFile: format: password section: basic title: JSON Key File type: string writeOnly: true jsonOAuthKeyFile: anyOf: - format: password type: string writeOnly: true - type: 'null' section: basic title: JSON OAuth Key File location: default: US examples: - US section: basic title: Processing Location type: string projectId: section: basic title: Project ID type: string totalMBytesProcessedLimit: anyOf: - type: integer - type: 'null' section: config title: Scanned Data Limit (MB) useStandardSql: default: true section: config title: Use Standard SQL type: boolean userDefinedFunctionResourceUri: anyOf: - type: string - type: 'null' examples: - gs://bucket/date_utils.js section: config title: UDF Source URIs required: - projectId - jsonKeyFile title: BigQueryConfig type: object DatabricksConfig: properties: database: anyOf: - type: string - type: 'null' title: Database host: maxLength: 128 title: Host type: string http_password: format: password title: Access Token type: string writeOnly: true http_path: default: '' title: HTTP Path type: string oauth_dwh_client_id: anyOf: - type: string - type: 'null' title: Oauth Dwh Client Id oauth_dwh_client_secret: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Oauth Dwh Client Secret required: - host - http_password title: DatabricksConfig type: object DuckDBConfig: properties: {} title: DuckDBConfig type: object MongoDBConfig: properties: auth_source: anyOf: - type: string - type: 'null' default: admin title: Auth Source connect_timeout_ms: default: 60000 title: Connect Timeout Ms type: integer database: title: Database type: string host: maxLength: 128 title: Host type: string password: format: password title: Password type: string writeOnly: true port: default: 27017 title: Port type: integer server_selection_timeout_ms: default: 60000 title: Server Selection Timeout Ms type: integer socket_timeout_ms: default: 300000 title: Socket Timeout Ms type: integer username: title: Username type: string required: - database - username - password - host title: MongoDBConfig type: object MySQLConfig: properties: db: title: Database name type: string host: maxLength: 128 title: Host type: string password: format: password title: Password type: string writeOnly: true port: default: 3306 title: Port type: integer user: title: User type: string required: - host - user - password - db title: MySQLConfig type: object MariaDBConfig: description: |- Configuration for MariaDB connections. MariaDB is MySQL-compatible, so we reuse the MySQL configuration. Default port is 3306, same as MySQL. properties: db: title: Database name type: string host: maxLength: 128 title: Host type: string password: format: password title: Password type: string writeOnly: true port: default: 3306 title: Port type: integer user: title: User type: string required: - host - user - password - db title: MariaDBConfig type: object MSSQLConfig: properties: dbname: anyOf: - type: string - type: 'null' title: Dbname host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 1433 title: Port type: integer require_encryption: default: true title: Require Encryption type: boolean session_script: anyOf: - type: string - type: 'null' description: >- The script to execute on connection; e.g. ALTER SESSION SET CONTAINER = ... title: Init script trust_server_certificate: default: false title: Trust Server Certificate type: boolean user: default: DATAFOLD title: User type: string required: - host title: MSSQLConfig type: object OracleConfig: properties: database: anyOf: - type: string - type: 'null' title: Database database_type: anyOf: - enum: - service - sid type: string - type: 'null' title: Database Type ewallet_password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: EWallet password ewallet_pem_file: anyOf: - format: password type: string writeOnly: true - type: 'null' title: EWallet PEM ewallet_pkcs12_file: anyOf: - format: password type: string writeOnly: true - type: 'null' title: EWallet PKCS12 ewallet_type: anyOf: - enum: - x509 - pkcs12 type: string - type: 'null' title: Ewallet Type host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: anyOf: - type: integer - type: 'null' title: Port session_script: anyOf: - type: string - type: 'null' description: >- The script to execute on connection; e.g. ALTER SESSION SET CONTAINER = ... title: Init script ssl: default: false title: Ssl type: boolean ssl_server_dn: anyOf: - type: string - type: 'null' description: 'e.g. C=US,O=example,CN=db.example.com; default: CN=' title: Server's SSL DN user: default: DATAFOLD title: User type: string required: - host title: OracleConfig type: object PostgreSQLConfig: properties: dbname: title: Database Name type: string host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 5432 title: Port type: integer role: anyOf: - type: string - type: 'null' title: Role (case sensitive) rootcert: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Root certificate sslmode: $ref: '#/components/schemas/SslMode' default: prefer title: SSL Mode user: title: User type: string required: - host - user - dbname title: PostgreSQLConfig type: object PostgreSQLAuroraConfig: properties: aws_access_key_id: anyOf: - type: string - type: 'null' title: AWS Access Key aws_cloudwatch_log_group: anyOf: - type: string - type: 'null' title: Cloudwatch Postgres Log Group aws_region: anyOf: - type: string - type: 'null' title: AWS Region aws_secret_access_key: anyOf: - format: password type: string writeOnly: true - type: 'null' title: AWS Secret dbname: title: Database Name type: string host: maxLength: 128 title: Host type: string keep_alive: anyOf: - type: integer - type: 'null' title: Keep Alive timeout in seconds, leave empty to disable password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 5432 title: Port type: integer role: anyOf: - type: string - type: 'null' title: Role (case sensitive) rootcert: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Root certificate sslmode: $ref: '#/components/schemas/SslMode' default: prefer title: SSL Mode user: title: User type: string required: - host - user - dbname title: PostgreSQLAuroraConfig type: object RedshiftConfig: properties: adhoc_query_group: default: default section: config title: Query Group for Adhoc Queries type: string dbname: title: Database Name type: string host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 5432 title: Port type: integer role: anyOf: - type: string - type: 'null' title: Role (case sensitive) rootcert: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Root certificate scheduled_query_group: default: default section: config title: Query Group for Scheduled Queries type: string sslmode: $ref: '#/components/schemas/SslMode' default: prefer title: SSL Mode user: title: User type: string required: - host - user - dbname title: RedshiftConfig type: object TeradataConfig: properties: database: title: Database type: string host: maxLength: 128 title: Host type: string password: format: password title: Password type: string writeOnly: true port: anyOf: - type: integer - type: 'null' title: Port user: default: DATAFOLD title: User type: string required: - host - password - database title: TeradataConfig type: object SapHanaConfig: properties: host: maxLength: 128 title: Host type: string password: format: password title: Password type: string writeOnly: true port: default: 443 title: Port type: integer user: default: DATAFOLD title: User type: string required: - host - password title: SapHanaConfig type: object AwsAthenaConfig: properties: aws_access_key_id: title: Aws Access Key Id type: string aws_secret_access_key: format: password title: Aws Secret Access Key type: string writeOnly: true catalog: default: awsdatacatalog title: Catalog type: string database: default: default title: Database type: string region: title: Region type: string s3_staging_dir: format: uri minLength: 1 title: S3 Staging Dir type: string required: - aws_access_key_id - aws_secret_access_key - s3_staging_dir - region title: AwsAthenaConfig type: object SnowflakeConfig: properties: account: maxLength: 128 title: Account type: string authMethod: anyOf: - enum: - password - keypair type: string - type: 'null' title: Authmethod data_source_id: anyOf: - type: integer - type: 'null' title: Data Source Id default_db: default: '' examples: - MY_DB title: Default DB (case sensitive) type: string default_schema: default: PUBLIC examples: - PUBLIC section: config title: Default schema (case sensitive) type: string keyPairFile: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Key Pair file (private-key) metadata_database: default: SNOWFLAKE examples: - SNOWFLAKE section: config title: Database containing metadata (usually SNOWFLAKE) type: string oauth_dwh_client_id: anyOf: - type: string - type: 'null' title: Oauth Dwh Client Id oauth_dwh_client_secret: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Oauth Dwh Client Secret password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: anyOf: - type: integer - type: 'null' default: 443 title: Port region: anyOf: - type: string - type: 'null' section: config title: Region role: default: '' examples: - PUBLIC title: Role (case sensitive) type: string sql_variables: anyOf: - type: string - type: 'null' examples: - |- variable_1=10 variable_2=test section: config title: Session variables applied at every connection. widget: multiline user: default: DATAFOLD title: User type: string user_id: anyOf: - type: integer - type: 'null' title: User Id warehouse: default: '' examples: - COMPUTE_WH title: Warehouse (case sensitive) type: string required: - account title: SnowflakeConfig type: object DremioConfig: properties: certcheck: anyOf: - $ref: '#/components/schemas/CertCheck' - type: 'null' default: dremio-cloud title: Certificate check customcert: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Custom certificate host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 443 title: Port type: integer project_id: anyOf: - type: string - type: 'null' title: Project id role: anyOf: - type: string - type: 'null' title: Role (case sensitive) tls: default: false title: Encryption type: boolean token: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Token username: anyOf: - type: string - type: 'null' title: User ID (optional) view_temp_schema: anyOf: - type: string - type: 'null' title: Temporary schema for views required: - host title: DremioConfig type: object StarburstConfig: properties: host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 443 title: Port type: integer tls: default: true title: Encryption type: boolean token: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Token username: anyOf: - type: string - type: 'null' title: User ID (optional) required: - host title: StarburstConfig type: object NetezzaConfig: properties: database: maxLength: 128 title: Database type: string host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 5480 title: Port type: integer tls: default: true title: Encryption type: boolean username: anyOf: - type: string - type: 'null' title: User ID (optional) required: - host - database title: NetezzaConfig type: object AzureDataLakeConfig: properties: account_name: title: Account Name type: string client_id: anyOf: - maxLength: 1024 type: string - type: 'null' title: Client Id client_secret: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Client Secret materialize_max_rows: anyOf: - type: integer - type: 'null' title: Materialize Max Rows materialize_path: anyOf: - type: string - type: 'null' title: Materialize Path tenant_id: anyOf: - maxLength: 1024 type: string - type: 'null' title: Tenant Id required: - account_name - tenant_id - client_id title: AzureDataLakeConfig type: object GCSConfig: properties: bucket_name: title: Bucket Name type: string bucket_region: title: Bucket Region type: string jsonKeyFile: format: password section: basic title: JSON Key File type: string writeOnly: true materialize_max_rows: anyOf: - type: integer - type: 'null' title: Materialize Max Rows materialize_path: anyOf: - type: string - type: 'null' title: Materialize Path required: - bucket_name - jsonKeyFile - bucket_region title: GCSConfig type: object AWSS3Config: properties: bucket_name: title: Bucket Name type: string key_id: anyOf: - maxLength: 1024 type: string - type: 'null' title: Key Id materialize_max_rows: anyOf: - type: integer - type: 'null' title: Materialize Max Rows materialize_path: anyOf: - type: string - type: 'null' title: Materialize Path region: title: Region type: string secret: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Secret required: - bucket_name - key_id - region title: AWSS3Config type: object MicrosoftFabricConfig: properties: client_id: description: Microsoft Entra ID Application (Client) ID title: Application (Client) ID type: string client_secret: description: Microsoft Entra ID Application Client Secret format: password title: Client Secret type: string writeOnly: true dbname: title: Dbname type: string host: maxLength: 128 title: Host type: string session_script: anyOf: - type: string - type: 'null' description: >- The script to execute on connection; e.g. ALTER SESSION SET CONTAINER = ... title: Init script tenant_id: description: Microsoft Entra ID Tenant ID title: Tenant ID type: string required: - host - dbname - tenant_id - client_id - client_secret title: MicrosoftFabricConfig type: object VerticaConfig: properties: dbname: title: Database Name type: string host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 5433 title: Port type: integer role: anyOf: - type: string - type: 'null' title: Role (case sensitive) sslmode: $ref: '#/components/schemas/SslMode' default: prefer title: SSL Mode user: title: User type: string required: - host - user - dbname title: VerticaConfig type: object TrinoConfig: properties: dbname: title: Catalog Name type: string hive_timestamp_precision: anyOf: - enum: - 3 - 6 - 9 type: integer - type: 'null' description: 'Optional: Timestamp precision if using Hive connector' title: Hive Timestamp Precision host: maxLength: 128 title: Host type: string password: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Password port: default: 8080 title: Port type: integer ssl_verification: $ref: '#/components/schemas/SSLVerification' default: full title: SSL Verification tls: default: true title: Encryption type: boolean user: title: User type: string required: - host - user - dbname title: TrinoConfig type: object TestResultStep: properties: result: anyOf: - {} - type: 'null' title: Result status: $ref: '#/components/schemas/JobStatus' step: $ref: '#/components/schemas/ConfigurationCheckStep' required: - step - status title: TestResultStep type: object SslMode: description: >- SSL mode for database connections (used by PostgreSQL, Vertica, Redshift, etc.) enum: - prefer - require - verify-ca - verify-full title: SslMode type: string CertCheck: enum: - disable - dremio-cloud - customcert title: CertCheck type: string SSLVerification: enum: - full - none - ca title: SSLVerification type: string JobStatus: enum: - needs_confirmation - needs_authentication - waiting - processing - done - failed - cancelled title: JobStatus type: string ConfigurationCheckStep: enum: - connection - temp_schema - schema_download - lineage_download title: ConfigurationCheckStep type: string securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/monitors/list-monitor-runs.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # List Monitor Runs ## OpenAPI ````yaml openapi-public.json get /api/v1/monitors/{id}/runs openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/monitors/{id}/runs: get: tags: - Monitors summary: List Monitor Runs operationId: list_monitor_runs_api_v1_monitors__id__runs_get parameters: - description: The unique identifier of the monitor. in: path name: id required: true schema: description: The unique identifier of the monitor. title: Id type: integer - description: The page number to retrieve. in: query name: page required: false schema: default: 1 description: The page number to retrieve. title: Page type: integer - description: The number of items to retrieve per page. in: query name: page_size required: false schema: default: 10 description: The number of items to retrieve per page. title: Page Size type: integer - description: Include runs with a timestamp >= this value. in: query name: start_time required: false schema: anyOf: - format: date-time type: string - type: 'null' description: Include runs with a timestamp >= this value. title: Start Time - description: Include runs with a timestamp <= this value. in: query name: end_time required: false schema: anyOf: - format: date-time type: string - type: 'null' description: Include runs with a timestamp <= this value. title: End Time responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiPublicListMonitorRunsOut' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ApiPublicListMonitorRunsOut: properties: count: description: Total number of monitor runs. title: Count type: integer page: description: Current page number in the paginated result. title: Page type: integer page_size: description: Number of runs per page. title: Page Size type: integer runs: description: List of monitor runs. items: $ref: '#/components/schemas/ApiPublicMonitorRunOut' title: Runs type: array total_pages: description: Total number of pages available. title: Total Pages type: integer required: - count - runs - page - page_size - total_pages title: ApiPublicListMonitorRunsOut type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ApiPublicMonitorRunOut: properties: diff_id: anyOf: - type: integer - type: 'null' description: Unique identifier for the associated datadiff. title: Diff Id monitor_id: description: Unique identifier for the associated monitor. title: Monitor Id type: integer run_id: description: Unique identifier for the monitor run. title: Run Id type: integer started_at: anyOf: - format: date-time type: string - type: 'null' description: Timestamp when the monitor run started. title: Started At state: anyOf: - $ref: '#/components/schemas/MonitorRunState' - type: 'null' description: Current state of the monitor run. required: - run_id - monitor_id title: ApiPublicMonitorRunOut type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object MonitorRunState: enum: - ok - alert - error - learning - checking - created - skipped - cancelled title: MonitorRunState type: string securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/monitors/list-monitors.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # List Monitors ## OpenAPI ````yaml openapi-public.json get /api/v1/monitors openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/monitors: get: tags: - Monitors summary: List Monitors operationId: list_monitors_api_v1_monitors_get parameters: - description: The page number to retrieve. in: query name: page required: false schema: default: 1 description: The page number to retrieve. title: Page type: integer - description: The number of items to retrieve per page. in: query name: page_size required: false schema: default: 20 description: The number of items to retrieve per page. title: Page Size type: integer - description: Field to order the monitors by. in: query name: order_by required: false schema: anyOf: - $ref: '#/components/schemas/SortableFields' - type: 'null' description: Field to order the monitors by. title: Order By - description: Specify the order direction for the monitors. in: query name: sort_order required: false schema: default: desc description: Specify the order direction for the monitors. enum: - asc - desc title: Sort Order type: string - description: Comma-separated tags to filter monitors by. in: query name: tags required: false schema: description: Comma-separated tags to filter monitors by. title: Tags type: string responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiPublicListMonitorsOut' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: SortableFields: enum: - id - name - last_triggered - last_run - created_by_id title: SortableFields type: string ApiPublicListMonitorsOut: properties: count: description: Total number of monitors. title: Count type: integer monitors: description: List of monitor details. items: $ref: '#/components/schemas/ApiPublicGetMonitorOut' title: Monitors type: array page: description: Current page number in the paginated result. title: Page type: integer page_size: description: Number of monitors per page. title: Page Size type: integer total_pages: description: Total number of pages available. title: Total Pages type: integer required: - count - monitors - page - page_size - total_pages title: ApiPublicListMonitorsOut type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ApiPublicGetMonitorOut: properties: created_at: description: Timestamp when the monitor was created. format: date-time title: Created At type: string enabled: description: Indicates whether the monitor is enabled. title: Enabled type: boolean id: description: Unique identifier for the monitor. title: Id type: integer last_alert: anyOf: - format: date-time type: string - type: 'null' description: Timestamp of the last alert. title: Last Alert last_run: anyOf: - format: date-time type: string - type: 'null' description: Timestamp of the last monitor run. title: Last Run modified_at: description: Timestamp when the monitor was last modified. format: date-time title: Modified At type: string monitor_type: anyOf: - enum: - diff - metric - schema - test type: string - type: 'null' description: Type of the monitor. title: Monitor Type name: anyOf: - type: string - type: 'null' description: Name of the monitor. title: Name state: anyOf: - $ref: '#/components/schemas/MonitorRunState' - type: 'null' description: Current state of the monitor run. tags: anyOf: - items: type: string type: array - type: 'null' description: Tags associated with the monitor. title: Tags required: - id - name - monitor_type - created_at - modified_at - enabled title: ApiPublicGetMonitorOut type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object MonitorRunState: enum: - ok - alert - error - learning - checking - created - skipped - cancelled title: MonitorRunState type: string securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/integrations/bi-data-apps/looker.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Looker ## Create a code repositories integration [Create a code repositories integration](/integrations/code-repositories) that connects Datafold to your Looker repository. ## Create a Looker integration Navigate to Settings > Integrations > Data Apps and add a Looker integration. Add New Integration Looker Integration Form Complete the configuration by specifying the following fields: | Field Name | Description | | ----------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | Integration name | An identifier used in Datafold to identify this Data App configuration. | | Project Repository | Select the same repository as used in your Looker project. | | API Host URL | The Looker [API Host URL](https://cloud.google.com/looker/docs/admin-panel-platform-api#api%5Fhost%5Furl). It has the following format: https\://\.cloud.looker.com:\. The port defaults are 19999 (legacy) and 443 (new), see the [Looker Docs](https://cloud.google.com/looker/docs/api-getting-started#looker%5Fapi%5Fpath%5Fand%5Fport) for hints. Examples: Legacy ([https://datafold.cloud.looker.com:19999](https://datafold.cloud.looker.com:19999)), New ([https://datafold.cloud.looker.com:443](https://datafold.cloud.looker.com:443)) | | Client ID | Follow [these steps](https://cloud.google.com/looker/docs/api-auth#authentication%5Fwith%5Fan%5Fsdk) to generate Client ID and Client Secret. These are always user specific. We recommend using a group email for continuity. See [Looker User Minimum Access Policy](/integrations/bi-data-apps/looker#looker-user-minimum-access-policy) for the required permissions. | | Client Secret | See Client ID. | | Data connection mapping | When the correct credentials are entered we will begin to populate data connections in Looker (on the left side) that will need to be mapped to data connections configured in Datafold (on the right side). See image below. | Looker Configuration When completed, click **Submit**. It may take some time to sync all the Looker entities to Datafold and for Data Explorer to populate. When completed, your Looker assets will appear in Data Explorer as search results. Looker Sync Results **TIP** [Tracking Jobs](/integrations/bi-data-apps/tracking-jobs) explains how to find out when your data app integration is ready. ## Looker user: minimum access policy The user linked to the API credentials needs the predefined Developer role, or you can create a custom role with these permissions: * `access_data` * `download_without_limit` * `explore` * `login_special_email` * `manage_spaces` * `see_drill_overlay` * `see_lookml` * `see_lookml_dashboards` * `see_looks` * `see_pdts` * `see_sql` * `see_user_dashboards` * `send_to_integration` ## Database/schema connection context ### Database specification Using the Fully Qualified Names in your Looker view files is not always possible. If a view references a table as`my_schema.my_table`, Datafold might have difficulty finding which database this table actually is in. There are multiple ways to guide Datafold to make a correct choice, as summarized in the table below. **INFO** Priority #1 takes precedence over #2, and so forth. | # | Source, if defined | Example | | - | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------------------------- | | 1 | datafold\_force\_database **User Attribute** in Looker | looker\_db | | 2 | **Fully Qualified Names** in your Looker view files | my\_db.my\_schema.my\_table | | 3 | datafold\_default\_database **User Attribute** in Looker | another\_looker\_db | | 4 | **Database** specified in Looker, at Database connection settings\_(We can only read these if Datafold connects to Looker via an admin user, which is probably suboptimal.)\_ | my\_db | | 5 | **Database** specified in Datafold, at [Database Connection settings](/integrations/databases/) | my\_db | ### Supported custom Looker user attributes | User Attribute | Impact | | --------------------------- | -------------------------------------------------------------------------------------------------------- | | datafold\_force\_database | Database to use in all cases, even if a fully qualified path in LookML refers to another database. | | datafold\_default\_database | Database to use if Looker View does not explictly specify a database. | | datafold\_default\_schema | Schema to use if Looker view does not explicitly specify a schema (which equals a dataset for BigQuery). | | datafold\_default\_host | *(BigQuery only)* Default project name. | **INFO** Make sure attributes are: * Explicitly defined for the user in question (not just falling back to a default); * Not marked as hidden. ## Integration limitations Datafold lets you connect to Looker and extend our capabilities to your Looker Views, Explores, Looks, and Dashboards. But this is a new feature, so there are some things we don’t support yet: * **PDT/Derived Tables**:Datafold only works with the tables that come from your data connections, but not with the [tables](https://cloud.google.com/looker/docs/derived-tables#important%5Fconsiderations%5Ffor%5Fimplementing%5Fpersisted%5Ftables) that Looker makes from your SQL queries. * **Merge Queries**: Datafold supports the Queries and Looks that make up your Dashboards, but [Merge Queries](https://cloud.google.com/looker/docs/merged-results) are not one of them. For some use cases you could achieve the same by joining the underlying views with an explore. * **Usage metrics and popularity**: Datafold shows you your Looker objects - such as dashboards, looks, and fields - but not how much you use or like them. We are improving our Looker integration and adding more features soon. We welcome your feedback and suggestions. --- # Source: https://docs.datafold.com/integrations/databases/mariadb.md # MariaDB **INFO** Column-level Lineage is not currently supported for MariaDB. **Steps to complete:** 1. [Run SQL script for permissions and create schema for Datafold](/integrations/databases/mariadb#run-sql-script-and-create-schema-for-datafold) 2. [Configure your data connection in Datafold](/integrations/databases/mariadb#configure-in-datafold) ### Run SQL script and create schema for Datafold To connect to MariaDB, create a user with read-only access to all tables you wish to diff. Include read and write access to a Datafold-specific dataset: ```Bash theme={null} -- Create a temporary dataset for Datafold to utilize CREATE DATABASE IF NOT EXISTS datafold_tmp; -- Create a Datafold user CREATE USER 'datafold_user'@'%' IDENTIFIED BY 'SOMESECUREPASSWORD'; -- Grant read access to diff tables in YourSchema GRANT SELECT ON `YourSchema`.* TO 'datafold_user'@'%'; -- Grant access to all tables in a datafold_tmp database GRANT ALL ON `datafold_tmp`.* TO 'datafold_user'@'%'; -- Apply the changes FLUSH PRIVILEGES; ``` Datafold utilizes a temporary dataset, named `datafold_tmp` in the above script, to materialize scratch work and keep data processing in the your warehouse. ### Configure in Datafold | Field Name | Description | | ---------------------------- | --------------------------------------------------------------------------------- | | Connection name | A name given to the data connection within Datafold | | Host | The hostname for your MariaDB instance | | Port | MariaDB connection port; default value is 3306 | | Username | The user created in our SQL script, named datafold\_user | | Password | The password created in our SQL script | | Database | The name of the MariaDB database (schema) you want to connect to, e.g. YourSchema | | Dataset for temporary tables | The datafold\_tmp database created in our SQL script | Click **Create**. Your data connection is ready! --- # Source: https://docs.datafold.com/data-monitoring/monitors/metric-monitors.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Metric Monitors > Metric monitors detect anomalies in your data using ML-based algorithms or manual thresholds, supporting standard and custom metrics for tables or columns. **INFO** Please contact [support@datafold.com](mailto:support@datafold.com) if you'd like to enable this feature for your organization. Metric monitors allow you to perform anomaly detection—either automatically using our ML-based algorithm or by setting manual thresholds—on the following metric types: 1. Standard metrics (e.g. row count, freshness, and cardinality) 2. Custom metrics (e.g. sales volume per region) ## Create a Metric monitor There are two ways to create a Metric Monitor: 1. Open the **Monitors** page, select **Create new monitor**, and then choose **Metric**. 2. Clone an existing Metric monitor by clicking **Actions** and then **Clone**. This will pre-fill the form with the existing monitor configuration. ## Set up your monitor Select your data connection, then choose the type of metric you'd like: **Table**, **Column**, or **Custom**. If you select table or column, you have the option to add a SQL filter to refine your dataset. For example, you could implement a 7-day rolling time window with the following: `timestamp >= dateadd(day, -7, current_timestamp)`. Please ensure the SQL is compatible with your selected data connection. ## Metric types ### Table metrics | Metric | Definition | Additional Notes | | --------- | --------------------------------- | -------------------------------------------------------------------------------------------------------------- | | Freshness | Time since table was last updated | Measured in minutes. Derived from INFORMATION\_SCHEMA. Only supported for Snowflake, BigQuery, and Databricks. | | Row Count | Total number of rows | | ### Column metrics | Metric | Definition | Supported Column Types | Additional Notes | | ------------------ | ------------------------------ | ---------------------- | -------------------------- | | Cardinality | Number of distinct values | All types | | | Uniqueness | Proportion of distinct values | All types | Proportion between 0 and 1 | | Minimum | Lowest numeric value | Numeric columns | | | Maximum | Highest numeric value | Numeric columns | | | Average | Mean value | Numeric columns | | | Median | Median value (50th percentile) | Numeric columns | | | Sum | Sum of all values | Numeric columns | | | Standard Deviation | Measure of data spread | Numeric columns | | | Fill Rate | Proportion of non-null values | All types | Proportion between 0 and 1 | ### Custom metrics Our custom metric framework is extremely flexible and supports several approaches to defining metrics. Depending on the approach you choose, your query should return some combination of the following columns: * **Metric value (required)**: a numeric column containing your *metric values* * **Timestamp (optional)**: a date/time column containing *timestamps* corresponding to your metric values * **Group (optional)**: a string column containing *groups/dimensions* for your metric **INFO** The names and order of your columns don't matter. Datafold will automatically infer their meaning based on data type. The following questions will help you decide which approach is best for you: 1. **Do you want to group your metric by the value of a column in your query?** For example, if your metric is *sales volume per day*, rather than looking at a single metric that encompasses all sales globally, it might be more informative to group by country. In this case, Datafold will automatically compute sales volume separately for each country to assist with root cause analysis when there’s an unexpected change. 2. **Will your query return a single metric value (per group, if relevant) on every monitor run, or an entire time series?** We generally recommend starting with the simpler approach of providing a single metric value (per group) per monitor run. However, if you’ve already defined a time series elsewhere (e.g. in your BI tool) and simply want to copy/paste that query into Datafold, then you may prefer the latter approach. **INFO** Datafold will only log a single data point per timestamp per group, which means you should only send data for a particular time period once that period is complete. 1. **If your metric returns a single value per monitor run, will you provide your own timestamps or use the timestamps of monitor runs?** If your query returns a single value per run, we generally recommend letting Datafold provide timestamps based on monitor runs unless you have a compelling reason to provide your own. For example, if your metric always lags by one day, you could explicitly associate yesterday's date with each observation. As you're writing your query, Datafold will let you know if the result set doesn't match one of the accepted patterns. If you have questions, please contact us and we'll be happy to help. ## Configure anomaly detection Enable anomaly detection to get the most out of metric monitors. You have several options: * **Automatic**: our automated anomaly detection uses machine learning to flag metric values that are out of the ordinary. Dial the sensitivity up or down depending on how many alerts you'd like to receive. * **Manual**: specific thresholds beyond which you'd like the monitor to trigger an alert. **Fixed Values** are specific minimum and/or maximum values, while **Percent Change** measure the magnitude of change from one observation to the next. ## Add a schedule You can choose to run your monitor daily, hourly, or even input a cron expression for more complex scheduling: ## Add notifications Send notifications via Slack or email when your monitor exceeds a threshold (automatic or manual): ## Need help? If you have any questions about how to use Metric monitors, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com). --- # Source: https://docs.datafold.com/security/single-sign-on/saml/examples/microsoft-entra-id-configuration.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Microsoft Entra ID ## Azure AD / Entra ID as a SAML Identity Provider You can create an **Enterprise Application** and use that to configure access to Datafold. Click on **New application** and **Create your own application**. **Copy** the **App Federation Metadata Url**. Go to `Datafold` and create a new SSO integration. Navigate to **Settings** → **Integrations** → **Add new Integration** → **SAML**. Paste the **copied** URL into **Identity Provider Metadata URL**. Go to `Azure` and edit the **Basic SAML Configuration** in your Enterprise App. Copy from Datafold the read-only field **Service Provider ACS URL** and paste it into **Reply URL**. Copy from Datafold the read-only field **Service Provider Entity ID** and paste it into **Identifier**. Go to `Datafold` and click **Save** to create the SAML integration. Next, edit the **Attributes & Claims**. By default, the **Unique User Identifier** is already correctly set to `user.userprincipalname`. If you have multiple domains (i.e., `@datafold.com` and `@datafoldonmicrosoft.com`), please make sure this maps correctly to the email addresses of the users in Datafold. (Optional step) Add two attributes: `first_name` and `last_name`. Finally, edit the **SAML Certificates**. Set the signing option to **Sign SAML response and assertion**. After you made sure you are added as a user to the Enterprise Application, log out from Datafold. Click on **Test** under **Test single sign-on with DatafoldSSO**. ## Synchronize user with Datafold \[Optional] This step is essential if you want to ensure that users from your organization are disabled if they are no longer assigned to the configured Microsoft Entra App. 1. Navigate to App registrations → API permissions. 2. Add the following permissions: `Group.Read.All` and `User.ReadBasic.All`. 2.1 Click `Add a permission`. 2.2 Select Microsoft Graph. 2.3 Select application permissions and add the required permissions. 3. Grant admin consent. 4. You should now see a next to the permissions. 5. Generate a secret so that Datafold can interact with the API. 5.1 Click `Certificates & secrets`. 5.2 Click `New client secret`. 5.3 Type in a description and click `Add`. 6. Go to `Datafold` and navigate to **Settings** → **Integrations** → **SSO** → **Add new Integration** and select the Microsoft Entra ID Logo. 7. Paste in the four required fields:
7.1 Tenant ID - [you can find this in the overview page](https://learn.microsoft.com/en-us/entra/fundamentals/how-to-find-tenant)
7.2 Navigate to the application overview
7.3 Copy Application ID and paste it into Client Id
7.4 Copy the secret we created in the previous steps and paste it into Client Secret
7.5 Navigate to the enterprise application and copy Object ID and paste it into Principal Id.
7.6 Click **Save** to create the integration.
If the update is successful, it means that the integration is valid. Users that do not have access to the configured application will be disabled and logged out in at most one hour. --- # Source: https://docs.datafold.com/integrations/notifications/microsoft-teams.md # Microsoft Teams > Receive notifications for monitors in Microsoft Teams. ## Prerequisites * Microsoft Teams admin access or permissions to manage integrations * A Datafold account with admin privileges ## Configure the Integration 1. In Datafold, go to Settings > Integrations > Notifications 2. Click "Add New Integration" 3. Select "Microsoft Teams" 4. You'll be automatically redirected to the Microsoft Office login page 5. Sign in using the Microsoft Office account with admin privileges 6. Click "Accept" to grant Datafold the necessary permissions 7. You'll be redirected back to Datafold 8. Open the Teams app in a separate browser tab 9. Next to the channel where you'd like to receive notifications, click "..." and select "Workflows" 10. Select the template called "Post to a channel when a webhook request is received" 11. Advance through the wizard (the defaults should be fine) 12. At the end of the wizard, copy the webhook URL 13. Return to Datafold and click "Add channel configuration" 14. Select the relevant Team and Channel, then paste the webhook URL 15. Repeat steps 8-14 for as many channels as you'd like 16. Save the integration settings in Datafold You're all set! When you configure a monitor in Datafold, you'll now have the option to send notifications to the Teams channel(s) you configured. ## Monitors as Code Configuration If you're using [monitors as code](/data-monitoring/monitors-as-code), you can configure Teams notifications by adding a `notifications` section to your monitor definition as follows: ```yaml theme={null} monitors: : ... notifications: - type: teams integration: channel: : mentions: - - ... ``` * `` can be found in Datafold -> Settings -> Integrations -> Notifications -> \ #### Full example ```yaml theme={null} monitors: uniqueness_test_example: type: test enabled: true connection_id: 1123 test: type: unique tables: - path: DEV.DATA_DEV.USERS columns: - USERNAME schedule: interval: every: hour notifications: - type: teams integration: 23 channel: Dev Team:Notifications Channel mentions: - NotifyDevCustomTag - Dima Cherenkov ``` ## Need help? If you have any questions about integrating with Microsoft Teams, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com). --- # Source: https://docs.datafold.com/integrations/bi-data-apps/mode.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Mode ## Obtain credentials from Mode **INFO** To complete this integration, your **Mode** account must be a part of a [Mode Business Workspace](https://mode.com/compare-plans) in order to generate an API Token. **INFO** You need to have **Admin** privileges in your Mode Workspace to be able to create an API Token. In **Mode**, navigate to **Workspace Settings** → **Privacy & Security** → **API**. Click the icon, and choose **Create new token**. Take note of: * Token Name, * Token Password, * And the URL of the page that lists the tokens. It should look like this: [https://app.mode.com/organizations/\{workspace}/api\_keys](https://app.mode.com/organizations/\{workspace}/api_keys) Take note of `{workspace}` part, we will need it when configuring Datafold. ## Configure Datafold Navigate to **Settings** → **Integrations** → **BI & Data Apps**. Choose **Mode** Integration to add. This will bring up **Mode** integration parameters. Complete the configuration by specifying the following fields: | Field Name | Description | | ---------------- | ----------------------------------------------------------------------- | | Integration name | An identifier used in Datafold to identify this Data App configuration. | | Token | API token, as generated above. | | Password | API token password, as generated above. | | Workspace | Workspace name obtained from your workspace URL. | **INFO** **Workspace Name** field is not marked as required on this screen. That's for backwards compatibility: the legacy type of Mode API token, known as **Personal Token**, does not require that parameter. However, such tokens can no longer be created, so we're no longer providing instructions for them. When completed, click **Save**. Datafold will try to connect to Mode and, if any issues with the connection arise you will be alerted. Datafold will start to sync your reports. It can take some time to fetch all the reports, depending on how many of them there are. **TIP** [Tracking Jobs](/integrations/bi-data-apps/tracking-jobs) explains how to find out when your data app integration is ready. Now that Mode sync has completed — you can browse your Mode reports! --- # Source: https://docs.datafold.com/integrations/databases/mongodb.md # MongoDB > Our MongoDB integration allows you to diff data within MongoDB, or between MongoDB and a relational database (or even a file!). Our MongoDB integration is still in beta. Some features, such as column-level lineage, are not yet supported. Please contact us if you need assistance. **Steps to complete:** 1. [Configure user in MongoDB](#configure-user-in-mongodb) 2. [Configure your data connection in Datafold](#configure-in-datafold) 3. [Diff your data](#diff-your-data) ## Configure user in MongoDB To connect to MongoDB, create a user with read-only access to all databases you plan to diff. ## Configure in Datafold | Field Name | Description | | ----------------------- | ---------------------------------------------------------------- | | Connection Name | The name you'd like to assign to this connection in Datafold | | Host | The hostname for your MongoDB instance | | Port | MongoDB endpoint port (default value is 27017) | | User ID | User ID (e.g. `DATAFOLD`) | | Password | Password for the user provided above | | Database | Database to connect to | | Authentication Database | Database name associated with the user credentials (e.g. `main`) | Click **Create**. Your data connection is now ready! ## Diff your data Write your MongoDB query MongoDB works a bit differently from our other integrations. Under the hood, we flatten your collections into datasets you can query with SQL. Here's how to diff your MongoDB data: 1. Create a new data diff 2. Select your MongoDB data connection 3. Select `Query` diff (`Table` diffs aren't supported at this time) 4. Write a SQL query against the flattened dataset, including a `PRAGMA` statement with the collection name on the first line. Here's an example: ```sql theme={null} PRAGMA mongodb_collections('tracks_v1_1m'); SELECT point_id, device_id, timestamp, location.longitude as longitude, location.latitude as latitude FROM mongo_tracks_v1_1m WHERE point_id < 100000; ``` 5. Configure the rest of your diff and run it! --- # Source: https://docs.datafold.com/data-monitoring/monitor-types.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Monitor Types > Monitoring your data for unexpected changes is one of the cornerstones of data observability. Datafold supports all your monitoring needs through a variety of different monitor types: 1. [**Data Diff**](/data-monitoring/monitors/data-diff-monitors) → Detect differences between any two datasets, within or across databases 2. [**Metric**](/data-monitoring/monitors/metric-monitors) → Identify anomalies in standard metrics like row count, freshness, and cardinality, or in any custom metric 3. [**Data Test**](/data-monitoring/monitors/data-test-monitors) → Validate your data with business rules and see specific records that fail your tests 4. [**Schema Change**](/data-monitoring/monitors/schema-change-monitors) → Receive alerts when a table schema changes If you need help creating your first few monitors, deciding which type of monitor to use in a particular situation, or developing an overall monitoring strategy, please reach out via email ([support@datafold.com](mailto:support@datafold.com)) and our team of experts will be happy to assist. --- # Source: https://docs.datafold.com/data-monitoring/monitors-as-code.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Monitors as Code > Manage Datafold monitors via version-controlled YAML for greater scalability, governance, and flexibility in code-based workflows. **INFO** Please contact [support@datafold.com](mailto:support@datafold.com) if you'd like to enable this feature for your organization. This is particularly useful if any of the following are true: * You have (or plan to have) 100s or 1000s of monitors * Your team is accustomed to managing things in code * Strict governance and change management are important to you ## Getting started **INFO** This section describes how to get started with GitHub Actions, but the same concepts apply to other hosted version control platforms like GitLab and Bitbucket. Contact us if you need help getting started. ### Set up version control integration To start using monitors as code, you'll need to decide which repository will contain your YAML configuration. If you've already connected a repository to Datafold, you could use that. Or, follow the instructions [here](/integrations/code-repositories) to connect a new repository. ### Generate a Datafold API key If you've already got a Datafold API key, use it. Otherwise, you can create a new one in the app by visiting **Settings > Account** and selecting **Create API Key**. ### Create monitors config In your chosen repository, create a new YAML file where you'll define your monitors config. For this example, we'll name the file `monitors.yaml` and place it in the root directory, but neither of these choices are hard requirements. Leave the file blank for now—we'll come back to it in a moment. ### Add CI workflow If you're using GitHub Actions, create a new YAML file under `.github/workflows/` using the following template. Be sure to tailor it to your particular setup: ```yaml theme={null} name: Apply monitors as code config to Datafold on: push: branches: - main # or master jobs: apply: runs-on: ubuntu-latest steps: - uses: actions/checkout@v2 - name: Set up Python uses: actions/setup-python@v2 with: python-version: 3.12 - name: Install dependencies run: | python -m pip install --upgrade pip pip install datafold-sdk - name: Update monitors run: datafold monitors provision monitors.yaml # use the correct file name/path env: DATAFOLD_HOST: https://app.datafold.com # different for dedicated deployments DATAFOLD_API_KEY: ${{ secrets.DATAFOLD_API_KEY }} # remember to add to secrets ``` ### Create a monitor Now return to your YAML configuration file to add your first monitor. Reference the list of examples below and select one that makes sense for your organization. ## Examples **INFO** These examples are intended to serve as inspiration and don't demonstrate every possible configuration. Contact us if you have any questions. ### Data Diff [Data Diff monitors](/data-monitoring/monitors/data-diff-monitors) detect differences between any two datasets, within or across databases. ```yaml theme={null} monitors: replication_test_example: name: 'Example of a custom name' description: 'Example of a custom description' type: diff enabled: true datadiff: diff_type: 'inmem' dataset_a: connection_id: 734 table: db.schema.table time_travel_point: '2020-01-01' materialize: false dataset_b: connection_id: 736 table: db.schema.table1 time_travel_point: '2020-01-01' materialize: true primary_key: - pk_column columns_to_compare: - col1 materialize_results: true materialize_results_to: 734 column_remapping: col1: col2 sampling: tolerance: 0.2 confidence: 0.95 threshold: 5000 ignore_string_case: true schedule: interval: every: hour replication_test_example_with_thresholds: type: diff enabled: true datadiff: diff_type: 'inmem' dataset_a: connection_id: 734 table: db.schema.table dataset_b: connection_id: 736 table: db.schema.table2 session_parameters: k: v primary_key: - pk_column tolerance: float: default: type: absolute value: 50 column_tolerance: A: type: relative value: 20 # % B: type: absolute value: 30.0 schedule: interval: every: hour alert: different_rows_count: 100 different_rows_percent: 10 replication_test_example_with_thresholds_and_notifications: type: diff enabled: true datadiff: diff_type: 'indb' dataset_a: connection_id: 734 table: db.schema.table dataset_b: connection_id: 734 table: db.schema.table3 primary_key: - pk_column schedule: interval: every: hour sampling: rate: 0.1 threshold: 100000 materialize_results: true tolerance: float: default: type: absolute value: 50 column_tolerance: A: type: relative value: 20 # % B: type: absolute value: 30.0 notifications: - type: email recipients: - valentin@datafold.com - type: slack integration: 123 channel: datafold-alerts mentions: - "here" - "channel" features: - attach_csv - notify_first_triggered_only - type: pagerduty integration: 124 - type: webhook integration: 125 alert: different_rows_count: 100 different_rows_percent: 10 ``` ### Metric [Metric monitors](/data-monitoring/monitors/metric-monitors) identify anomalies in standard metrics like row count, freshness, and cardinality, or in any custom metric. ```yaml theme={null} monitors: table_metric_example: type: metric enabled: true connection_id: 736 metric: type: table table: db.schema.table filter: deleted is false metric: freshness # see full list of options below alert: type: automatic sensitivity: 10 schedule: interval: every: day hour: 8 # 0-23 UTC column_metric_example: type: metric enabled: true connection_id: 736 metric: type: column table: db.schema.table column: some_col filter: deleted is false metric: sum # see full list of options below alert: type: percentage increase: 30 # % decrease: 0 tags: - oncall - action-required schedule: cron: 0 0 * * * # every day at midnight UTC custom_metric_example: name: custom metric example type: metric connection_id: 123 notifications: [] tags: [] enabled: true metric: type: custom query: select * from table alert_on_missing_data: true alert: type: absolute max: 22.0 min: 12.0 schedule: interval: every: day type: daily ``` #### Supported metrics For more details on supported metrics, see the docs for [Metric monitors](/data-monitoring/monitors/metric-monitors#metric-types). **Table metrics:** * Freshness: `freshness` * Row Count: `row_count` **Column metrics:** * Cardinality: `cardinality` * Uniqueness: `uniqueness` * Minimum: `minimum` * Maximum: `maximum` * Average: `average` * Median: `median` * Sum: `sum` * Standard Deviation: `std_dev` * Fill Rate: `fill_rate` ### Data Test [Data Test monitors](/data-monitoring/monitors/data-test-monitors) validate your data with business rules and surface specific records that fail your tests. ```yaml theme={null} monitors: custom_data_test_example: type: test enabled: true connection_id: 736 query: select 1 from db.schema.table schedule: interval: every: hour tags: - team_1 accepted_values_test_example: type: test enabled: true connection_id: 736 test: type: accepted_values tables: - path: db.schema.table columns: - column_name variables: accepted_values: value: - 12 - 15 quote: false schedule: interval: every: hour numeric_range_test_example: type: test enabled: true connection_id: 736 test: type: numeric_range tables: - path: db.schema.table columns: - column_name variables: maximum: value: 15 quote: false schedule: interval: every: hour ``` **Supported variables by Standard Data Test (SDT) type** | SDT Type | Monitor-as-Code Type | Supported Variables | Variable Type | | --------------------- | ----------------------- | ------------------- | ---------------------- | | Unique | `unique` | - | - | | Not Null | `not_null` | - | - | | Accepted Values | `accepted_values` | `accepted_values` | Collection with values | | Referential Integrity | `referential_integrity` | - | - | | Numeric Range | `numeric_range` | `minimum` | Single value | | | | `maximum` | Single value | ### Schema Change [Schema Change monitors](/data-monitoring/monitors/schema-change-monitors) detect when changes occur to a table's schema. ```yaml theme={null} monitors: schema_change_example: type: schema enabled: true connection_id: 736 table: db.schema.table schedule: interval: every: day hour: 22 # 0-23 UTC tags: - team_2 ``` ## Bulk Manage with Wildcards For certain monitor types—[Freshness](/data-monitoring/monitors/metric-monitors), [Row Count](/data-monitoring/monitors/metric-monitors), and [Schema Change](/data-monitoring/monitors/schema-change-monitors)—it's possible to create/manage many monitors at once using the following wildcard syntax: ```yaml theme={null} row_count_monitors: type: metric connection_id: 123 metric: type: table metric: row_count # include all tables in the WAREHOUSE database include_tables: WAREHOUSE.* # exclude all tables in the INFORMATION_SCHEMA schema exclude_tables: WAREHOUSE.INFORMATION_SCHEMA.* schedule: interval: every: day hour: 10 # 0-23 UTC ``` This is particularly useful if you want to create the same monitor type for many tables in a particular database or schema. Note in the example above that you can specify both `include_tables` and `exclude_tables` to fine-tune your selection. ## FAQ Yes, it's not all or nothing. You can still create/manage monitors in the app even if you're defining others in code. By default, nothing—it remains in the app. However, you can add the `--dangling-monitors-strategy [delete|pause]` flag to your `run` command to either delete or pause notifications if they're removed from your code. For example: ```bash theme={null} datafold monitors provision monitors.yaml --dangling-monitors-strategy delete ``` Note: this only applies to monitors that were created from code, not those created in the UI. Add the `--dangling-monitors-strategy [delete|pause]` flag to your `run` command and replace the contents of your YAML file with the following: ```yaml theme={null} monitors: {} ``` Note that providing an empty YAML file will likely produce an error and not have the same effect. No, any monitors created from code will be read-only in the app (though they can still be cloned). Yes, you can export all monitors from the app to manage them as code. There are two ways to do this: 1. Exporting all monitors: Navigate to the Monitors list page and click the **View as Code** button 2. Exporting a single monitor: Go to the specific monitor and click **Actions** and then select **View as Code** Note that when exporting monitors, pay attention to the `id` field in the YAML. If you want to preserve monitor history, keep the `id` field as this will update the original monitor to be managed as code. If you don't want to preserve your monitor history, **delete** the `id` field to create a new monitor as code while keeping the original monitor intact. ## Need help? If you have any questions about how to use monitors as code, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com). --- # Source: https://docs.datafold.com/integrations/databases/mysql.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # MySQL **INFO** Please contact [support@datafold.com](mailto:support@datafold.com) if you use a MySQL version \< 8.x. **INFO** Column-level Lineage is not currently supported for MySQL. **Steps to complete:** 1. [Run SQL script for permissions and create schema for Datafold](/integrations/databases/mysql#run-sql-script-and-create-schema-for-datafold) 2. [Configure your data connection in Datafold](/integrations/databases/mysql#configure-in-datafold) ### Run SQL script and create schema for Datafold To connect to MySQL, create a user with read-only access to all tables you wish to diff. Include read and write access to a Datafold-specific dataset: ```Bash theme={null} -- Create a temporary dataset for Datafold to utilize CREATE DATABASE IF NOT EXISTS datafold_tmp; -- Create a Datafold user CREATE USER 'datafold_user'@'%' IDENTIFIED BY 'SOMESECUREPASSWORD'; -- Grant read access to diff tables in YourSchema GRANT SELECT ON `YourSchema`.* TO 'datafold_user'@'%'; -- Grant access to all tables in a datafold_tmp database GRANT ALL ON `datafold_tmp`.* TO 'datafold_user'@'%'; -- Apply the changes FLUSH PRIVILEGES; ``` Datafold utilizes a temporary dataset, named `datafold_tmp` in the above script, to materialize scratch work and keep data processing in the your warehouse. ### Configure in Datafold | Field Name | Description | | ---------------------------- | ------------------------------------------------------------------------------- | | Connection name | A name given to the data connection within Datafold | | Host | The hostname for your MySQL instance | | Port | MySQL connection port; default value is 3306 | | Username | The user created in our SQL script, named datafold\_user | | Password | The password created in our SQL script | | Database | The name of the MySQL database (schema) you want to connect to, e.g. YourSchema | | Dataset for temporary tables | The datafold\_tmp database created in our SQL script | Click **Create**. Your data connection is ready! --- # Source: https://docs.datafold.com/integrations/databases/netezza.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Netezza **INFO** Column-level Lineage is not currently supported for Netezza. **Steps to complete:** 1. [Configure user in Netezza](#configure-user-in-netezza) 2. [Create schema for Datafold](#create-a-temporary-database-for-datafold) 3. [Configure your data connection in Datafold](#configure-in-datafold) ## Configure user in Netezza To connect to Netezza, create a user with read-only access to all databases you may wish to diff. ## Create a temporary database for Datafold Datafold requires a schema with full permissions to store temporary data. ## Configure in Datafold | Field Name | Description | | --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------- | | Connection Name | A name given to the data connection within Datafold. | | Host | The hostname for your Netezza instance (e.g., nz-85dcf66c-69aa-4ba6-b7cb-827643da5a.us-east-1.data-warehouse.cloud.ibm.com for Netezza SaaS). | | Port | Netezza endpoint port; the default value is 5480. | | Encryption | Whether to use TLS. | | User ID | User ID, e.g., DATAFOLD. | | Password | Password from above. | | Default DB | The database to connect to. | | Schema for Temporary Tables | Use DATABASE.SCHEMA format. | Click **Create**. Your data source is now ready! --- # Source: https://docs.datafold.com/deployment-testing/getting-started/universal/no-code.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # No-Code > Set up Datafold's No-Code CI integration to create and manage Data Diffs without writing code. Monitors are easy to create and manage in the Datafold app. But for teams (or individual users) who prefer a more code-based approach, our monitors as code feature allows managing monitors via version-controlled YAML. ## Getting Started Get up and running with our No-Code CI integration in just a few steps. ### 1. Create a repository integration Connect your code repository using the appropriate [integration](/integrations/code-repositories). ### 2. Create a No-Code integration From the integrations page, create a new No-Code CI integration. ### 3. Set up the No-Code integration Complete the configuration by specifying the following fields: #### Basic settings | Field Name | Description | | ------------------ | ----------------------------------------------------- | | Configuration name | Choose a name for your Datafold integration. | | Repository | Select the repository you configured in step 1. | | Data Connection | Select the data connection your repository writes to. | #### Advanced settings | Field Name | Description | | ------------------ | ----------------------------------------------------------------------------------------------------------------------------- | | Pull request label | When this is selected, the Datafold CI process will only run when the `datafold` label has been applied to your pull request. | | Custom base branch | If provided, the Datafold CI process will only run on pull requests against the specified base branch. | ### 4. Create a pull request and add diffs Datafold will automatically post a comment on your pull request with a link to generate a CI run that corresponds to the latest set of changes. ### 5. Add diffs to your CI run Once in Datafold, add as many pull requests as you'd like to the CI run. If you need a refresher on how to configure data diffs, check out [our docs](/data-diff/in-database-diffing/creating-a-new-data-diff). ### 6. Add a summary to your pull request Click on **Save and Add Preview to PR** to post a summary to your pull request. ### 7. View the summary in your pull request ## Cloning diffs from the last CI run If you make additional changes to your pull request, clicking the **Add data diff** button generates a new CI run in Datafold. From there, you can: * Create a new Data Diff from scratch * Clone diffs from the last CI run You can also diff downstream tables by clicking on the **Add Data Diff** button in the Downstream Impact table. This creates additional Data Diffs: You can then post another summary to your pull request by clicking **Save and Add Preview to PR**. --- # Source: https://docs.datafold.com/integrations/oauth.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # OAuth Support > Set up OAuth App Connections in your supported data warehouses to securely execute data diffs on behalf of your users. This feature is currently supported for Databricks, Snowflake, Redshift, and BigQuery. OAuth support empowers users to run data diffs based on their individual permissions and roles configured within the data warehouses. This ensures that data access is governed by existing security policies and protocols. ## How it works ### 1. Create a Data Diff When you attempt to run a data diff, you will notice that it won't run without authentication: ### 2. Authorize the Data Diff Authorize the data diff by clicking the **Authenticate** button. This will redirect you to the data warehouse for authentication: Upon successful authentication, you will be redirected back. ### 3. The Data Diff is now running ### 4. View the Data Diff results The results reflect your permissions within the data warehouse: Note that running the same data diff, as a different user, renders different results: The masked values represent the data retrieved from the data warehouse. We do not conduct any post-processing: By default, results are only visible for their authors. Users can still clone the data diffs, but the results might be different depending on their data warehouse access levels. For example, as a different user, I won't be able to access the data diff results for Filip's data diff: ### 5. Sharing Data Diffs Data diff sharing is a feature that enables you to share data diffs with other users. This is useful in scenarios such as compliance verification, where auditors can access specific data diffs without first requiring permissions to be set up in the data warehouse. Sharing can be accessed via the **Actions** dropdown on the data diff page: Note that data diff sharing is disabled by default: It can be enabled under **Org Settings** by clicking on **Allow Data Diff sharing**: Once enabled, you can share data diffs with other users: ## Configuring OAuth Navigate to **Settings** and click on your data connection. Then, click on **Advanced settings** and under **OAuth**, set the **Client Id** and **Client Secret** fields: ## Example: Databricks To create a new Databricks app connection: 1. Go to **Settings** and **App connections**. 2. Click **Add connection** in the top right of the screen. 3. Fill in the required fields: Application Name: ``` Datafold OAuth connection ``` Redirect URLs: ``` https://app.datafold.com/api/internal/oauth_dwh/callback ``` **INFO** Datafold caches **access tokens** and using **refresh tokens** fetches new valid tokens in order to complete the diffs and reduce the number of times users need to authenticate against the data warehouses. One hour is sufficient for the access token. The refresh token will determine the frequency of user reauthentication, whether it's daily, weekly, or monthly. ### 3. Click **Add** to obtain the **Client ID** and **Client Secret** ### 4. Fill in the **Client ID** and **Client Secret** fields in Datafold's Data Connection advanced settings: ### 5. Click **Test and save OAuth** You will be redirected to Databricks to complete authentication. If you are already authenticated, you will be redirected back. This notification signals a successful OAuth configuration: ### Additional steps for Databricks To ensure that users have correct access rights to temporary tables (stored in **Dataset for temporary tables** provided in the **Basic settings** for the Databricks connection), follow these steps: 1. Update the permissions for the **Dataset for temporary tables** in Databricks. 2. Grant these permissions to Datafold users: **USE SCHEMA** and **CREATE TABLE**. This will ensure that materialization results from data diffs are only readable by their authors. ## Example: Snowflake To create a new Snowflake app connection: 1. Go to Snowflake and run this SQL: ```Bash theme={null} CREATE SECURITY INTEGRATION DATAFOLD_OAUTH TYPE = OAUTH ENABLED = TRUE OAUTH_CLIENT = CUSTOM OAUTH_CLIENT_TYPE = 'CONFIDENTIAL' OAUTH_REDIRECT_URI = 'https://app.datafold.com/api/internal/oauth_dwh/callback' PRE_AUTHORIZED_ROLES_LIST=(, , ...) OAUTH_ISSUE_REFRESH_TOKENS = TRUE OAUTH_REFRESH_TOKEN_VALIDITY = 604800 OAUTH_ENFORCE_PKCE=TRUE; ``` It should result in this message: **CAUTION** * `PRE_AUTHORIZED_ROLES_LIST` must include all roles allowed to use the current security integration. * By default, `ACCOUNTADMIN`, `SECURITYADMIN`, and `ORGADMIN` are not allowed to be included in `PRE_AUTHORIZED_ROLES_LIST`. **INFO** Datafold caches **access tokens** and uses **refresh tokens** to fetch new valid tokens in order to complete the diffs and reduce the number of times users need to authenticate against the data warehouses. `OAUTH_REFRESH_TOKEN_VALIDITY` can be in the range of 3600 (1 hour) to 7776000 (90 days). 1. To retrieve `OAUTH_CLIENT_ID` and `OAUTH_CLIENT_SECRET`, run the following SQL: ``` select system$show_oauth_client_secrets('DATAFOLD_OAUTH'); ``` ### Example result: 1. Fill in the **Client ID** and **Client Secret** fields in Datafold's Data Connection advanced settings: 2. Click **Test and save OAuth** You will be redirected to Snowflake to complete authentication. info Your default Snowflake role will be used for the generated **access token**. This notification signals a successful OAuth configuration: ### Additional steps for Snowflake To guarantee correct access rights to temporary tables (stored in **Dataset for temporary tables** provided in the **Basic settings** for Snowflake connection): * Grant the required privileges on the database and `TEMP` schema for all roles that will be using the OAuth flow. This must be done for all roles that will be utilizing the OAuth flow. ```Bash theme={null} GRANT USAGE ON WAREHOUSE TO ROLE ; GRANT USAGE ON DATABASE TO ROLE ; GRANT USAGE ON ALL SCHEMAS IN DATABASE TO ROLE ; GRANT USAGE ON FUTURE SCHEMAS IN DATABASE TO ROLE ; GRANT ALL ON SCHEMA . TO ROLE ; ``` * Revoke `SELECT` privileges for tables in the `TEMP` schema for all roles that will be using the OAuth flow (except for the `DATAFOLDROLE` role), if they were provided. This action must be performed for all roles utilizing the OAuth flow\.. ```Bash theme={null} -- Revoke SELECT privileges for the TEMP SCHEMA revoke SELECT ON ALL TABLES IN SCHEMA . FROM ROLE ; revoke SELECT ON FUTURE TABLES IN SCHEMA . FROM ROLE ; revoke SELECT ON ALL VIEWS IN SCHEMA . FROM ROLE ; revoke SELECT ON FUTURE VIEWS IN SCHEMA . FROM ROLE ; revoke SELECT ON ALL MATERIALIZED VIEWS IN SCHEMA . FROM ROLE ; revoke SELECT ON FUTURE MATERIALIZED VIEWS IN SCHEMA . FROM ROLE ; -- Revoke SELECT privileges for a Database revoke SELECT ON ALL TABLES IN DATABASE FROM ROLE ; revoke SELECT ON FUTURE TABLES IN DATABASE FROM ROLE ; revoke SELECT ON ALL VIEWS IN DATABASE FROM ROLE ; revoke SELECT ON FUTURE VIEWS IN DATABASE FROM ROLE ; revoke SELECT ON ALL MATERIALIZED VIEWS IN DATABASE FROM ROLE ; revoke SELECT ON FUTURE MATERIALIZED VIEWS IN DATABASE FROM ROLE ; ``` **CAUTION** If one of the roles will have `FUTURE GRANTS` at the database level, this role will also will have `FUTURE GRANTS` on the `TEMP` schema. ## Example: Redshift Redshift does not support OAuth2. To execute data diffs on behalf of a specific user, that user needs to provide their own credentials to Redshift. 1. Configure permissions on the Redshift side. Grant the necessary access rights to temporary tables (stored in the **Dataset for temporary tables** provided in the **Basic settings** for Redshift connection): ```Bash theme={null} GRANT USAGE on SCHEMA to ; GRANT CREATE on SCHEMA to ; ``` 1. As an Administrator, select the **Enabled** toggle in Datafold's Redshift Data Connection **Advanced settings**: Then, click the **Test and Save** button. 1. As a User, add your Redshift credentials into Datafold. Click on your Datafold username to **Edit Profile**: Then, click **Add credentials** and select the required Redshift data connection from the **Data Connections** list: Finally, provide your Redshift username and password, and configure the **Delete on** field (after this date, your credentials will be removed from Datafold): Click **Create credentials**. ## Example: BigQuery 1. To create a new Google Cloud OAuth 2.0 Client ID, go to the Google Cloud console, navigate to **APIs & Services**, then **Credentials**, and click **+ CREATE CREDENTIALS**: Select **OAuth client ID**: From the list of **Application type**, select **Web application**: Provide a name in the **Name** field: In **Authorized redirect URIs**, provide `https://app.datafold.com/api/internal/oauth_dwh/callback`: Click **CREATE**. Then, download the OAuth Client credentials as a JSON file: 1. Activate BigQuery OAuth in Datafold by uploading the JSON OAuth credentials in the **JSON OAuth keys file** section, in Datafold's BigQuery Data Connection **Advanced settings**: Click **Test and Save**. ### Additional steps for BigQuery 1. Create a new temporary schema (dataset) for each OAuth user. Go to Google Cloud console, navigate to BigQuery, select your project in BigQuery, and click on **Create dataset**: Provide `datafold_tmp_` as the **Dataset ID** and set the same region as configured for other datasets. Click **CREATE DATASET**: 1. Configure permissions for `datafold_tmp_`. Grant read/write/create/delete permissions to the user for their `datafold_tmp_` schema. This can be done by granting roles like **BigQuery Data Editor** or **BigQuery Data Owner** or any custom roles with the required permissions. Go to Google Cloud console, navigate to BigQuery, select `datafold_tmp_` dataset, and click **Create dataset** → **Manage Permissions**: Click **+ ADD PRINCIPAL**, specify the user and role, then click **SAVE**: caution Ensure that only the specified user (excluding admins) has read/write/create/delete permissions on `datafold_tmp_`. 1. Configure temporary schema in Datafold. As a user, navigate to `https://app.datafold.com/users/me`. If the user lacks credentials for BigQuery, click on **+ Add credentials**, select BigQuery datasource from the list, and click **Create credentials**: The user will be redirected to `accounts.google.com` and then returned to the previous page: Select BigQuery credentials from the list, input the **Temporary Schema** field in the format `.>`, and click **Update**: **INFO** Users can update BigQuery credentials only if they have the correct permissions for ``. --- # Source: https://docs.datafold.com/security/single-sign-on/saml/examples/okta.md # Source: https://docs.datafold.com/security/single-sign-on/okta.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Okta (OIDC) **NOTE** Okta SSO is available for both SaaS and dedicated cloud installations of Datafold. ## Create Okta App Integration[](#create-okta-app-integration "Direct link to Create Okta App Integration") **INFO** Creating an App Integration in Okta may require admin privileges. Start the integration by creating a web app integration in Okta. Next, log in to Okta interface and navigate to **Applications** and click **Create App Integration**. Then, in the configuration form, select **OpenId Connect (OIDC)** and **Web Application** as the Application Type. In the following section, you will set: * **App integration name**: A name to identify the integration. We suggest you use `Datafold`. * **Grant type**: Should be set to `Authorization code` automatically. * **Sign-in redirect URI**: The redirect URL should be `https://app.datafold.com/oauth/okta/client_id`, where `client_id` is the Client ID of the configuration. **CAUTION** You will be given the Client ID after saving the integration and need to come back to update the client ID afterwards. The redirect URL should be `https://your-dns-name/oauth/okta`, replacing `your-dns-name` with the DNS name for your installation. * **Sign-out redirect URIs**: Leave this empty. * **Trusted Origins**: Leave this empty too. * **Assignments**: Select `Skip group assignment for now`. Later you should assign the correct groups and users. * Click "Save" to create the app integration in Okta. Once the save is successful, on the next screen, you'll be presented with Client ID and Client Secret. We need these IDs to update the redirect URLs that Datafold needs. We'll also apply the Client ID and Client Secret in the Datafold integration later. * Edit "General settings" * Scroll down to the **Login** section * Update the **Sign-in redirect URI**. See above for details. * Click "Save" to persist the changes. ## Set Up Okta-initiated login **TIP** Organization admins will always be able to log in with either password or Okta. Non-admin users will be required to log in through Okta once configured. This step is optional and should be done at the discretion of the Okta administrator. Users in your organization can log in to the application directly from the Okta end-user dashboard. To enable this feature, configure the integration as follows: 1. Edit "General settings" 2. Set **Login initiated by** to `Either Okta or App`. 3. Set **Application visibility** to `Display application icon to users`. 4. Set **Login flow** to `Redirect to app to initiate login (OIDC Compliant)`. 5. Set **Initiate login URI**: * `https://app.datafold.com/login/sso/client-id?action=desired_action` * Replace `client-id` with the Client ID of the configuration, and * Replace `desired_action` with `signup` if you enabled users auto-creation, or `login` otherwise. * `https://your-dns-name/login/sso/client-id?action=desired_action` * Replace `client-id` with the Client ID of the configuration, and * Replace `desired_action`with `signup` if you enabled users auto-creation, or `login` otherwise. * Replace `your-dns-name` with the DNS name for your installation. 1. Click "Save" to persist the changes. The Okta configuration is now complete. ## Configure Okta in Datafold To finish the configuration, create an Okta integration in Datafold. To complete the integration in Datafold, create a new integration by navigating to **Settings** → **Integrations** → **SSO** → **Add new integration** → **Okta**. * Paste in your Okta **Client Id** and **Client Secret**. * The **Metadata Url** of Okta OAuth server is `https:///.well-known/openid-configuration`, replace `okta-server-name` with the name of your Okta domain. * If you'd like to auto-create users in Datafold that are authorized in Okta, enable the **Allow Okta to auto-create users in Organization** switch. * Finally, click **Save**. **TIP** Users can either be explicitly invited in Datafold by an admin user, using the same email as used in Okta, or they can be auto-created. When the `signup` action is set in the login URI, authenticated users on Okta who have been assigned as a user in Okta of the Datafold application will then be able to login. If that user has not yet been invited, Datafold will then automatically create a user for them, since they're already authenticated by the Okta server of your domain. The user will then receive an email to confirm their email address. ## Synchronize state with Datafold \[Optional] This step is essential if you want to ensure that users from your organization are automatically logged out when they are unassigned or deactivated in Okta. 1. Navigate to **Okta Admin panel** → **Workflow** → **Event Hooks** 2. Click **Create Event Hook** 3. Set **Name** to `Datafold` 4. Set **URL** to `https://app.datafold.com/hooks/oauth/okta/` 5. Set **Authentication field** to `secret` 6. Go to Datafold and generate a secret token in **Settings** → **Integrations** → **SSO** → **Okta**. Click the **Generate** button, copy it by using the **Copy** button and click **Save**. Use the pasted code in the **Authentication secret** field in Okta. **CAUTION** Keep this secret token safe as you won't be able to see after saving your Integration. 7. In **Subscribe to events** add events: `User suspended`, `User deactivated`, `Deactivate application`, `User unassigned from app` 8. Click **Save & Continue** . On **Verify Endpoint Ownership** click **Verify** * If the verification is successful, you have completed the setup. ## Testing the Okta integration * Visit [https://app.datafold.com](https://app.datafold.com) * Type in your email and wait up to five seconds. * The Okta button should switch from disabled to enabled. * Click the Okta login button. * The browser should be redirected to your Okta domain, authenticate the user there and be redirected back to the Datafold application. * Visit `https://your-dns-name`, replacing your-dns-name with the domain name of your installation. * Type in your email and wait up to five seconds. * The Okta button should switch from disabled to enabled. * Click the Okta login button. * The browser should be redirected to your Okta domain, authenticate the user there and be redirected back to the Datafold application. If this didn't work, pay close attention to any error messages, or contact `support@datafold.com`. --- # Source: https://docs.datafold.com/deployment-testing/configuration/datafold-ci/on-demand.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Running Data Diff for Specific PRs/MRs > By default, Datafold CI runs on every new pull/merge request and commits to existing ones. To **only** run Datafold CI when the user explicitly requests it, you can set **Run only when tagged** option in the Datafold app [CI settings](https://app.datafold.com/settings/integrations/ci) which will only allow Datafold CI to run if a `datafold` tag/label is assigned to the pull/merge request. ## Running data diff on specific file changes By default, Datafold CI will run on any file change in the repo. To skip Datafold CI runs for certain modified files (e.g., if the dbt code is placed in the same repo with non-dbt code), you can specify files to ignore. The pattern uses the syntax of .gitignore. Excluded files can be re-included by using the negation. ### Example Let's say the dbt project is a folder in a repo that contains other code (e.g., Airflow). We want to run Datafold CI for changes to dbt models but skip it for other files. For that, we exclude all files in the repo except those the /dbt folder. We also want to filter out `.md` files in the /dbt folder: ```Bash theme={null} *!dbt/*dbt/*.md ``` **SKIPPING SPECIFIC DBT MODELS** To skip diffing individual dbt models in CI, use the [never\_diff](/deployment-testing/configuration/model-specific-ci/excluding-models) option in the Datafold dbt yaml config. --- # Source: https://docs.datafold.com/integrations/databases/oracle.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Oracle **INFO** Please contact [support@datafold.com](mailto:support@datafold.com) if you use an Oracle version \< 19.x. **INFO** Column-level Lineage is not currently supported for Oracle. **Steps to complete:** 1. [Run SQL script and create schema for Datafold](/integrations/databases/oracle#run-sql-script-and-create-schema-for-datafold) 2. [Configure your data connection in Datafold](/integrations/databases/oracle#configure-in-datafold) ## Run SQL script and create schema for datafold\_group To connect to Oracle, create a user with read-only access to all tables you wish to diff. Include read and write access to a Datafold-specific temp schema: ```Bash theme={null} -- Switch container context (default is "XEPDB1") ALTER SESSION SET CONTAINER = YOURCONTAINER; -- Create a Datafold user/schema CREATE USER DATAFOLD IDENTIFIED BY somesecurepassword; -- Allow Datafold user to connect GRANT CREATE SESSION TO DATAFOLD; -- Allow user to create tables in DATAFOLD schema GRANT CREATE TABLE TO DATAFOLD; -- Grant read access to diff tables in your schema GRANT SELECT ON "YOURSCHEMA"."YOURTABLE" TO DATAFOLD; -- Grant access to DBMS_CRYPTO utilities (hashing functions, etc.) GRANT EXECUTE ON SYS.DBMS_CRYPTO TO DATAFOLD; -- Allow Datafold users/schemas to use disk space (adjust if needed) GRANT UNLIMITED TABLESPACE TO DATAFOLD; -- Apply the changes COMMIT; ``` Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse. ## Configure in Datafold | Field Name | Description | | --------------------------- | ---------------------------------------------------------------------------------------------- | | Name | A name given to the data connection within Datafold | | Host | The hostname address for your database | | Port | Postgres connection port; default value is 1521 | | User | The user role created in our SQL script, named DATAFOLD | | Password | The password created in our SQL script | | Connection type | Choose Service or SID depending on your connection type; default value is Service | | Service (or SID) | The name of the database (Service or SID) you want to connect to, e.g. XEPDB1 or YOURCONTAINER | | Schema for temporary tables | The user/schema created in our SQL script - DATAFOLD | Click **Create**. Your data connection is ready! --- # Source: https://docs.datafold.com/integrations/orchestrators.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Integrate with Orchestrators > Integrate Datafold with dbt Core, dbt Cloud, Airflow, or custom orchestrators to streamline your data workflows with automated monitoring, testing, and seamless CI integration. **NOTE** To integrate with dbt, first set up a [Data Connection](/integrations/databases) and integrate with [Code Repositories](/integrations/code-repositories). Then navigate to **Settings** → **dbt** and click **Add New Integration**. Set up Datafold with dbt Core to enable automated data diffs and CI/CD integration. Integrate with dbt Cloud to enable automated data diffs and CI/CD integration. Use Datafold's API and SDK to build custom CI integrations tailored to your workflow. --- # Source: https://docs.datafold.com/faq/overview.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Overview > Get answers to the most common questions regarding our product. Have a question that isn’t answered here? Feel free to reach out to us at [support@datafold.com](mailto:support@datafold.com), and we’ll be happy to assist you! --- # Source: https://docs.datafold.com/integrations/notifications/pagerduty.md # PagerDuty > Receive notifications for monitors in PagerDuty. ## Prerequisites * PagerDuty access with permissions to manage `Services` * A Datafold account with admin privileges ## Configure the Integration 1. In Datafold, go to Settings > Integrations > Notifications 2. Click "Add New Integration" 3. Select "PagerDuty" 4. Go to the PagerDuty console and [create a new `Service`](https://support.pagerduty.com/main/docs/services-and-integrations#create-a-service) 5. Select `Events API V2` as a service integration 6. Go to your service's `Integrations` page and copy the `Integration Key` (or [generate a new one](https://support.pagerduty.com/main/docs/services-and-integrations#generate-a-new-integration-key)) 7. Return to Datafold and provide `Service Name` and `Integration Key` 8. Save the integration settings in Datafold You're all set! When you configure a monitor in Datafold, you'll now have the option to send notifications to the PagerDuty integration you configured. ## Need help? If you have any questions about integrating with PagerDuty, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com). --- # Source: https://docs.datafold.com/faq/performance-and-scalability.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Performance and Scalability Datafold is highly scalable, supporting data teams working with billion-row datasets and thousands of data transformation/dbt models. It offers powerful performance optimization features such as [SQL filtering](/deployment-testing/configuration/model-specific-ci/sql-filters), [sampling](/data-diff/cross-database-diffing/best-practices), and [Slim Diff](/deployment-testing/best-practices/slim-diff), which allow you to focus on testing the datasets that are most critical to your business, ensuring efficient and targeted data quality validation. Datafold pushes down compute to your database, and the performance of data diffs largely depends on the underlying SQL engine. Here are some in-app strategies to optimize performance: 1. [Enable sampling](/data-diff/cross-database-diffing/best-practices): Sampling reduces the amount of data processed by comparing a randomly chosen subset. This approach balances diff detail with processing time and cost, suitable for most use cases. 2. [Use SQL Filters](/deployment-testing/configuration/model-specific-ci/sql-filters): If you only need to compare a specific subset of data (e.g., for a particular city or a recent time period), adding a SQL filter can streamline the diff process. 3. **Exclude columns/tables**: When certain columns or tables are unnecessary for critical comparisons—such as temporary tables with dynamic values, metadata fields, or timestamp columns that always differ—you can exclude these to increase diff efficiency and speed. You can exclude columns when you create a new Data Diff or when you clone an existing one: To exclude them in your CI/CD pipeline, [follow this guide](/integrations/orchestrators/dbt-core#advanced-settings-configuration) to specify them in the Advanced settings of your CI/CD configuration in Datafold. 4. **Optimize SQL queries**: Refactor your SQL queries to improve the efficiency of database operations, reducing execution time and resource usage. 5. **Leverage database performance features**: Ensure your database is configured to match typical diff workload patterns. Utilize features like query optimization, caching, and parallel processing to boost performance. 6. **Increase data warehouse resources**: If using a platform like Snowflake, consider increasing the size of your warehouse to allocate more resources to Datafold operations. --- # Source: https://docs.datafold.com/integrations/databases/postgresql.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # PostgreSQL **INFO** Column-level Lineage is supported for AWS Aurora and RDS Postgres and *requires* Cloudwatch to be configured. **Steps to complete:** 1. [Run SQL script and create schema for Datafold](/integrations/databases/postgresql#run-sql-script-and-create-schema-for-datafold) 2. [Configure your data connection in Datafold](/integrations/databases/postgresql#configure-in-datafold) ## Run SQL script and create schema for Datafold To connect to Postgres, you need to create a user with read-only access to all tables in all schemas, write access to Datafold-specific schema for temporary tables: ```Bash theme={null} /* Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in your warehouse. */ CREATE SCHEMA datafold_tmp; /* Create a datafold user */ CREATE ROLE datafold WITH LOGIN ENCRYPTED PASSWORD 'SOMESECUREPASSWORD'; /* Give the datafold role write access to the temporary schema */ GRANT ALL ON SCHEMA datafold_tmp TO datafold; /* Make sure that the postgres user has read permissions on the tables */ GRANT USAGE ON SCHEMA TO datafold; GRANT SELECT ON ALL TABLES IN SCHEMA TO datafold; ``` Datafold utilizes a temporary schema, named `datafold_tmp` in the above script, to materialize scratch work and keep data processing in the your warehouse. ## Configure in Datafold | Field Name | Description | | --------------------------- | --------------------------------------------------------------- | | Name | A name given to the data connection within Datafold | | Host | The hostname address for your database; default value 127.0.0.1 | | Port | Postgres connection port; default value is 5432 | | User | The user role created in our SQL script, named datafold | | Password | The password created in our SQL script | | Database Name | The name of the Postgres database you want to connect to | | Schema for temporary tables | The schema (datafold\_tmp) created in our SQL script | Click **Create**. Your data connection is ready! *** ## Column-level Lineage with Aurora & RDS This will guide you through setting up Column-level Lineage with AWS Aurora & RDS using CloudWatch. **Steps to complete:** 1. [Setup Postgres with Permissions](#run-sql-script) 2. [Increase the logging verbosity of Postgres](#increase-logging-verbosity) so Datafold can parse lineage 3. [Set up an account for fetching the logs from CloudWatch.](#connect-datafold-to-cloudwatch) 4. [Configure your data connection in Datafold](#configure-in-datafold) ### Run SQL Script To connect to Postgres, you need to create a user with read-only access to all tables in all schemas, write access to Datafold-specific schema for temporary tables: ```Bash theme={null} /* Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse. */ CREATE SCHEMA datafold_tmp; /* Create a datafold user */ CREATE ROLE datafold WITH LOGIN ENCRYPTED PASSWORD 'SOMESECUREPASSWORD'; /* Give the datafole role write access to the temporary schema */ GRANT ALL ON SCHEMA datafold_tmp TO datafold; /* Make sure that the postgres user has read permissions on the tables */ GRANT USAGE ON SCHEMA TO datafold; GRANT SELECT ON ALL TABLES IN SCHEMA TO datafold; ``` ### Increase logging verbosity Then, create a new `Parameter Group`. Database instances run with default parameters that do not include logging verbosity. To turn on the logging verbosity, you'll need to create a new Parameter Group. Hit **Parameter Groups** on the menu and create a new Parameter Group. Next, select the `aurora-postgresql10` parameter group family. This depends on the cluster that you're running. For Aurora serverless, this is the appropriate family. Finally, set the `log_statement` enum field to `mod` - meaning that it will log all the DDL statements, plus data-modifying statements. Note: This field isn't set by default. After saving the parameter group, go back to your database, and select the database cluster parameter group. ### Connect Datafold to CloudWatch Start by creating a new user to isolate the permissions as much as possible. Go to IAM and create a new user. Next, create a new group named `CloudWatchLogsReadOnly` and attach the `CloudWatchLogsReadOnlyAccess` policy to it. Next, select the group. When reviewing the user, it should have the freshly created group attached to it. After confirming the new user you should be given the `Access Key` and `Secret Key`. Save these two codes securely to finish configurations on Datafold. The last piece of information Datafold needs is the CloudWatch Log Group. You will find this in CloudWatch under the Log Group section in the sidebar. It will be formatted as `/aws/rds/cluster//postgresql`. ### Configure in Datafold | Field Name | Description | | ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------- | | Name | A name given to the data connection within Datafold | | Host | The hostname address for your database; default value 127.0.0.1 | | Port | Postgres connection port; default value is 5432 | | User | The user role created in the SQL script; datafold | | Password | The password created in the SQL permissions script | | Database Name | The name of the Postgres database you want to connect to | | AWS Access Key | The Access Key provided in the [Connect Datafold to CloudWatch](/integrations/databases/postgresql#connect-datafold-to-cloudwatch) step | | AWS Secret | The Secret Key provided in the [Connect Datafold to CloudWatch](/integrations/databases/postgresql#connect-datafold-to-cloudwatch) step | | Cloudwatch Postgres Log Group | The path of the Log Group; formatted as /aws/rds/cluster/\/postgresql | | Schema for temporary tables | The schema created in the SQL setup script; datafold\_tmp | Click **Create**. Your data connection is ready! --- # Source: https://docs.datafold.com/integrations/bi-data-apps/power-bi.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Power BI > Include Power BI entities in Data Explorer and column-level lineage. ## Overview Our Power BI integration can help you visualize column-level lineage dependencies between warehouse tables and Power BI entities using [Data Explorer](/data-explorer/how-it-works). Datafold supports the following Power BI entity types: * Tables (with Columns) * Reports (with Fields) * Dashboards ## Choose your authentication method Datafold supports two authentication methods for Power BI. Choose the one that best fits your organization's needs. Key difference: * Delegated auth uses your user's identity, is tied to your account and permissions, also requiring you to be a Power Platform Administrator; * Service Principal is an independent application identity that doesn't depend on any user, but can be a bit more complicated to setup. ### Set up the integration Navigate to [**Microsoft 365 admin center** -> **Active users**](https://admin.microsoft.com/#/users) and choose the user that Datafold will authenticate under. As highlighted in the screenshot above, this user should have the **Power Platform Administrator** role assigned to it. Click **Manage roles**, enable the **Power Platform Administrator** role, and save changes. Navigate to [Power BI Admin Portal](https://app.powerbi.com/admin-portal/tenantSettings?experience=power-bi) and enable the following two settings: * Enhance admin APIs responses with detailed metadata * Enhance admin APIs responses with DAX and mashup expressions In the Datafold app, navigate to **Settings** -> **BI & Data Apps**, and click **+ Add new integration**. Choose **Power BI** from the list. ...and then **Save**. On clicking **Save**, the system will redirect you to Power BI. ...if not already signed in. Allow the Datafold integration to use Power BI. Depending on the roles configured for your user in the Admin center, you may require a confirmation from a **Global Administrator**. Follow the steps in the wizard. You will be redirected back to Datafold and see a message that Power BI is successfully connected. ### Set up the integration 1. Go to [Microsoft Entra admin center - New Registration](https://entra.microsoft.com/?l=en.en-us#view/Microsoft_AAD_RegisteredApps/CreateApplicationBlade/quickStartType~/null/isMSAApp~/false) 2. Configure the application: * **Name**: `Datafold Power BI Integration` (or similar) * **Supported account types**: "Accounts in this organizational directory only" * **Redirect URI**: Leave blank (not needed for Service Principal) 3. Click **Register** 4. Note the **Application (client) ID** and **Directory (tenant) ID** from the Overview page 1. In the App Registration, go to **Certificates & secrets** 2. Click **New client secret** 3. Add a description (e.g., "Datafold integration") and choose an expiration period 4. Click **Add** 5. **Important**: Copy the secret **Value** immediately—it won't be shown again 1. Go to [Microsoft Entra admin center - Groups](https://entra.microsoft.com/?l=en.en-us#view/Microsoft_AAD_IAM/AddGroupBlade) 2. Click **New group** 3. Configure: * **Group type**: Security * **Group name**: `Power BI Service Principals` (or similar) * **Group description**: "Service principals allowed to access Power BI APIs" * **Membership type**: Assigned 4. In the **Members** section, click **Add members** 5. Search for and add your App Registration (by name or Client ID) 6. Click **Create** 1. Go to [Power BI Admin Portal](https://app.powerbi.com/admin-portal/tenantSettings) 2. Navigate to **Tenant settings** 3. Enable these settings and apply them to your security group (or to the whole organization, as you see fit): * **Allow service principals to use Power BI APIs** * **Allow service principals to use read-only admin APIs** * **Enhance admin APIs responses with detailed metadata** * **Enhance admin APIs responses with DAX and mashup expressions** You must explicitly grant access to each workspace you want Datafold to sync: 1. Open the Power BI workspace you want to sync 2. Click **Access** (or the gear icon -> Manage access) 3. Add your App Registration as an **Admin** or **Member** 4. Repeat for each workspace you want Datafold to access 1. Go to Datafold -> **Settings** -> **BI & Data Apps** 2. Click **+ Add new integration** -> **Power BI** 3. Select **Service Principal** as the authentication type 4. Enter the credentials: * **Client ID**: The Application (client) ID from Step 1 * **Client Secret**: The secret value from Step 2 * **Tenant ID**: The Directory (tenant) ID from Step 1 5. Click **Save** ## Verify the integration You can check out **Jobs** -> **BI & Data Apps** for the status of the sync job. See [Tracking Jobs](/integrations/bi-data-apps/tracking-jobs) for more details. When the sync is complete, you will see Power BI entities in **Data Explorer**. ## Need help? If you have any questions about our Power BI integration, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com). --- # Source: https://docs.datafold.com/deployment-testing/configuration/primary-key.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Primary Key Inference > Datafold requires a primary key to perform data diffs. Using dbt metadata, Datafold identifies the column to use as the primary key for accurate data diffs. Datafold supports composite primary keys, meaning that you can assign multiple columns that make up the primary key together. ## Metadata The first option is setting the `primary-key` key in the dbt metadata. There are [several ways to configure this](https://docs.getdbt.com/reference/resource-configs/meta) in your dbt project using either the `meta` key in a yaml file or a model-specific config block. ```Bash theme={null} models: - name: users columns: - name: user_id meta: primary-key: true ## for compound primary keys, set all parts of the key as a primary-key ## # - name: company_id # meta: # primary-key: true ``` ## Tags If the primary key is not found in the metadata, it will go through the [tags](https://docs.getdbt.com/reference/resource-properties/tags). ```Bash theme={null} models: - name: users columns: - name: user_id tags: - primary-key ## for compound primary keys, tag all parts of the key ## # - name: company_id # tags: # - primary-key ``` ## Inferred If the primary key isn't provided explicitly, Datafold will try to infer a primary key from dbt's uniqueness tests. If you have a single column uniqueness test defined, it will use this column as the PK. ```Bash theme={null} models: - name: users columns: - name: user_id tests: - unique ``` Also, model-level uniqueness tests can be used for inferring the PK. ```Bash theme={null} models: - name: sales columns: - name: col1 - name: col2 ... tests: - unique: column_name: "col1 || col2" # or column_name: "CONCAT(col1, col2)" # we also support dbt_utils unique_combination_of_columns test - dbt_utils.unique_combination_of_columns: combination_of_columns: - order_no - order_line ``` Keep in mind that this is a failover mechanism. If you change the uniqueness test, this will also impact the way Datafold performs the diff. --- # Source: https://docs.datafold.com/data-explorer/profile.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Profile > View a data profile that summarizes key table and column-level statistics, and any upstream dependencies. --- # Source: https://docs.datafold.com/integrations/databases/redshift.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Redshift **Steps to complete:** 1. [Run SQL script and create schema for Datafold](/integrations/databases/redshift#run-sql-script-and-create-schema-for-datafold) 2. [Configure your data connection in Datafold](/integrations/databases/redshift#configure-in-datafold) ## Run SQL script and create schema for Datafold To connect to Amazon Redshift, you must create a user with the following permissions: * **Read-only access** to all tables in all schemas * **Write access** to a dedicated temporary schema for Datafold * **Access to SQL logs** for lineage construction Datafold uses a temporary dataset to materialize scratch work and keep data processing in the your warehouse. Create the schema with: ``` CREATE SCHEMA datafold_tmp; ``` Next, create the Datafold user. To grant read access to all schemas, the user must have superuser-level privileges in Redshift: ``` CREATE USER datafold CREATEUSER PASSWORD 'SOMESECUREPASSWORD'; ``` Grant unrestricted access to system logs so Datafold can build column-level lineage: ``` ALTER USER datafold WITH SYSLOG ACCESS UNRESTRICTED; ``` Datafold utilizes a temporary schema, named `datafold_tmp` in the above script, to materialize scratch work and keep data processing in your warehouse. ## Configure in Datafold | Field Name | Description | | --------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------- | | Name | A name given to the data connection within Datafold | | Host | The hostname of your cluster. (Go to Redshift in your AWS console, select your cluster, the hostname is the endpoint listed at the top of the page) | | Port | Redshift connection port; default value is 5439 | | User | The user created in our SQL script, named `datafold` | | Password | The password created in our SQL script | | Database Name | The name of the Redshift database you want to connect to | | Schema for temporary tables | The schema (`datafold_tmp`) created in our SQL script | Click **Create**. Your data connection is ready! --- # Source: https://docs.datafold.com/api-reference/bi/remove-an-integration.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Remove an integration ## OpenAPI ````yaml delete /api/v1/lineage/bi/{bi_datasource_id}/ openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineage/bi/{bi_datasource_id}/: delete: tags: - BI - bi_deleted summary: Remove an integration operationId: remove_integration_api_v1_lineage_bi__bi_datasource_id___delete parameters: - in: path name: bi_datasource_id required: true schema: title: BI integration id type: integer responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/bi/rename-a-power-bi-integration.md # Rename a Power BI integration > It can only update the name. Returns the integration with changed fields. ## OpenAPI ````yaml openapi-public.json put /api/v1/lineage/bi/powerbi/{bi_datasource_id}/ paths: path: /api/v1/lineage/bi/powerbi/{bi_datasource_id}/ method: put servers: - url: https://app.datafold.com description: Default server request: security: - title: ApiKeyAuth parameters: query: {} header: Authorization: type: apiKey description: Use the 'Authorization' header with the format 'Key ' cookie: {} parameters: path: bi_datasource_id: schema: - type: integer required: true title: Power BI integration id query: {} header: {} cookie: {} body: application/json: schemaArray: - type: object properties: indexing_cron: allOf: - anyOf: - type: string - type: 'null' title: Indexing Cron name: allOf: - anyOf: - type: string - type: 'null' title: Name required: true title: PowerBIDataSourceConfig description: Power BI data source parameters. refIdentifier: '#/components/schemas/PowerBIDataSourceConfig' examples: example: value: indexing_cron: name: response: '200': application/json: schemaArray: - type: any examples: example: value: description: Successful Response '422': application/json: schemaArray: - type: object properties: detail: allOf: - items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError refIdentifier: '#/components/schemas/HTTPValidationError' examples: example: value: detail: - loc: - msg: type: description: Validation Error deprecated: false type: path components: schemas: ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object ```` --- # Source: https://docs.datafold.com/faq/resource-management.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Resource Management Recognizing the importance of efficient data reconciliation, we provide a number of strategies to make the diffing process as efficient as possible: **Efficient Algorithm** Datafold connects to any SQL source and target databases, similar to how BI tools do. Datasets from both data connections are co-located in a centralized database to execute comparisons and identify specific rows, columns, and values with differences. To perform diffs at massive scale and increased speed, users can apply sampling, filtering, and column selection. **Flexible Controls** Users can easily control the volume of data used in diffing by using: * [Filters](/deployment-testing/configuration/model-specific-ci/sql-filters): Focus on the most relevant part of the dataset * [Sampling](/data-diff/cross-database-diffing/best-practices): Set sampling as a percentage of rows or desired confidence level * [Slim Diff](/deployment-testing/best-practices/slim-diff): Selectively diff only the models that have dbt code changes in your pull request. **Workload Management** Users can apply controls to enforce low diffing footprint: * On the Datafold side: Set desired concurrency * On the database side: Most databases support workload management settings to ensure that Datafold does not consume more than X% CPU or Y% RAM Also, consider that using a data quality tool like Datafold to catch issues before production will reduce cost over time as it lowers the need for expensive reprocessing and troubleshooting. Datafold's features like filtering, sampling, and Slim Diff ensure that only relevant datasets are tested, minimizing the computational load on your data warehouse. This targeted approach can lead to more efficient resource usage and potentially lower data warehouse operation costs. --- # Source: https://docs.datafold.com/data-diff/in-database-diffing/results.md # Source: https://docs.datafold.com/data-diff/cross-database-diffing/results.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Results > Once your data diff is complete, Datafold provides a concise, high-level summary of the detected changes in the Overview tab. ## Overview The top-level menu displays the diff status, job ID, creation and completed times, runtime, and data connection. ## Columns The Columns tab displays a table with detailed column and type mappings from the two datasets being diffed, with status indicators for each column comparison (e.g., identical, percentage of values different). This provides a quick way to identify data inconsistencies and prioritize updates. ## Primary keys This tab highlights rows that are unique to the Target dataset in a data diff ("Rows exclusive to Target"). As this identifies rows that exist only in the Target dataset and not in the Source dataset based on the primary key, it flags potential data discrepancies. The Clone **diffs and materialize results** button allows you to rerun existing data diffs with results materialized in the warehouse, as well as any other desired modifications. ## Values This tab displays rows where at least one column value differs between the datasets being compared. It is useful for quickly assessing the extent of discrepancies between the two datasets. The **Show filters** button enables the following features: * Highlight characters: highlight value differences between tables * % of difference: filters and displays columns based on the specified percentage range of value differences --- # Source: https://docs.datafold.com/api-reference/lineagev2/run-cypher.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Run Cypher > Execute arbitrary Cypher query and return results. Args: request: CypherRequest with query string Returns: CypherResponse containing: - columns: List of column names returned by the query - results: List of result rows as dictionaries (tabular view) - nodes: All graph nodes returned by the query - edges: All graph edges/relationships returned by the query Example queries: - Find all tables: "MATCH (t:Dataset) RETURN t.name LIMIT 10" - Find circular dependencies: "MATCH (t:Dataset)-[:DEPENDS_ON*]->(t) RETURN t" - Count by type: "MATCH (d:Dataset) RETURN d.asset_type, count(*) as count" - Complex lineage: "MATCH path=(c1:Column)-[:DERIVED_FROM*1..3]->(c2:Column) RETURN path" WARNING: This endpoint executes arbitrary Cypher queries. It is intended for internal debugging and power users only. All queries are logged for audit purposes. Note: Results include both tabular data (for displaying in tables) and graph data (nodes/edges for graph visualization). ## OpenAPI ````yaml openapi-public.json post /api/internal/lineagev2/cypher openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/internal/lineagev2/cypher: post: tags: - lineagev2 summary: Run Cypher description: >- Execute arbitrary Cypher query and return results. Args: request: CypherRequest with query string Returns: CypherResponse containing: - columns: List of column names returned by the query - results: List of result rows as dictionaries (tabular view) - nodes: All graph nodes returned by the query - edges: All graph edges/relationships returned by the query Example queries: - Find all tables: "MATCH (t:Dataset) RETURN t.name LIMIT 10" - Find circular dependencies: "MATCH (t:Dataset)-[:DEPENDS_ON*]->(t) RETURN t" - Count by type: "MATCH (d:Dataset) RETURN d.asset_type, count(*) as count" - Complex lineage: "MATCH path=(c1:Column)-[:DERIVED_FROM*1..3]->(c2:Column) RETURN path" WARNING: This endpoint executes arbitrary Cypher queries. It is intended for internal debugging and power users only. All queries are logged for audit purposes. Note: Results include both tabular data (for displaying in tables) and graph data (nodes/edges for graph visualization). operationId: run_cypher_api_internal_lineagev2_cypher_post requestBody: content: application/json: schema: $ref: '#/components/schemas/CypherRequest' required: true responses: '200': content: application/json: schema: $ref: '#/components/schemas/CypherResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: CypherRequest: properties: query: title: Query type: string required: - query title: CypherRequest type: object CypherResponse: properties: columns: items: type: string title: Columns type: array edges: items: $ref: '#/components/schemas/CypherEdge' title: Edges type: array nodes: items: $ref: '#/components/schemas/CypherNode' title: Nodes type: array results: items: additionalProperties: true type: object title: Results type: array required: - columns - results - nodes - edges title: CypherResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object CypherEdge: properties: id: title: Id type: string properties: additionalProperties: true title: Properties type: object source: title: Source type: string target: title: Target type: string type: title: Type type: string required: - id - source - target - type - properties title: CypherEdge type: object CypherNode: properties: id: title: Id type: string labels: items: type: string title: Labels type: array properties: additionalProperties: true title: Properties type: object required: - id - labels - properties title: CypherNode type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/security/single-sign-on/saml.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # SAML > SAML (Security Assertion Markup Language) is a protocol that enables secure user authentication by integrating Identity Providers (IdPs) with Service Providers (SPs). **NOTE** SAML SSO is available for both SaaS and VPC installations of Datafold. In this case, Datafold is the service provider. The Identity Providers can be anything used by the organization (e.g., Google, Okta, Duo). We also support SAML SSO [group provisioning](/security/single-sign-on/saml/group-provisioning). ## Generic SAML Identity Providers **TIP** We also provide SAML identity providers configurations for ([Okta](/security/single-sign-on/saml/examples/okta), [Microsoft Entra ID](/security/single-sign-on/saml/examples/microsoft-entra-id-configuration), and [Google](/security/single-sign-on/saml/examples/google)) To configure a SAML provider: 1. Go to `Datafold`. Create a new integration by navigating to **Settings** → **Integrations** → **SSO** → **Add new integration** → **SAML**. 1. Go to the organization's `Identity Provider`, create a **SAML application** (sometimes called a **single sign-on** or **SSO** method). If you have the option, enable the SAML Response signature and set it to **whole-response signing**. 1. Copy and paste the Service Provider URLs from the `Datafold` SAML Integration into the `Identity Provider`'s application setup. The only two mandatory fields are **Service Provider Entity ID** and the **Service Provider ACS URL**. After creation, The `Identity Provider` will show you the metadata XML. It may be presented as raw XML, a URL to the XML, or an XML file to download. **INFO** The Identity Providers sometimes provide additional parameters, such as SSO URLs, ACS URLs, SLO URLs, etc. We gather this information from the XML directly so these can be safely ignored. 1. Paste either the **metadata XML** *or* **metadata URL** from your `Identity Provider` into the respective `Datafold` SAML integration fields. 2. Finally, click the **Save** button to create the integration. After creation, the SAML login button will be available for Datafold users in your organization. 1. In your `Identity Provider`, activate the SAML application for all users or for select groups. **CAUTION** Only configured users in your identity provider will be able to login into Datafold *using* SAML SSO. ### Auto-create users in Datafold Go to `Datafold` and navigate to **Settings** → **Integrations** → **SSO** → **SAML**. Enable the **Allow SAML to auto-create users in Organization** switch and save the integration. If the **Allow SAML to auto-create users in Organization** switch from the SAML Integration in Datafold is enabled, identity provider-initiated logins will automatically create users in Datafold for authenticated users. If the **Allow SAML to auto-create users in Organization** switch from the SAML Integration in Datafold is enabled, the SAML login button will always be enabled, and all authenticated users will be automatically created in Datafold. --- # Source: https://docs.datafold.com/integrations/databases/sap-hana.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # SAP HANA **INFO** Column-level Lineage is not currently supported for SAP HANA. **Steps to complete:** 1. [Create and authorize a user](#create-and-authorize-a-user) 2. [Create schema for Datafold](#create-schema-for-datafold) 3. [Configure in Datafold](#configure-in-datafold) ## Create and authorize a user Create a new user `DATAFOLD` using SAP HANA Administration console (Systems-Security-Users). Specify password authentication, and set "Force password change on next logon" to "No". Grant MONITORING privileges for the databases to be diffed. ## Create schema for Datafold Datafold utilizes a temporary schema to materialize scratch work and keep data processing in the your warehouse. ``` CREATE SCHEMA datafold_tmp OWNED BY DATAFOLD; ``` ## Configure in Datafold | Field Name | Description | | --------------------------- | ---------------------------------------------------- | | Name | A name given to the data connection within Datafold. | | Host | The hostname address for your database. | | Port | Sap HANA connection port; default value is 443. | | User | The user created above, named DATAFOLD. | | Password | The password for user DATAFOLD. | | Schema for temporary tables | The schema created above, named datafold\_tmp | Click **Create**. Your data connection is ready! --- # Source: https://docs.datafold.com/data-monitoring/monitors/schema-change-monitors.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Schema Change Monitors > Schema Change monitors notify you when a table’s schema changes, such as when columns are added, removed, or data types are modified. **INFO** Please contact [support@datafold.com](mailto:support@datafold.com) if you'd like to enable this feature for your organization. Schema change monitors alert you when a table’s schema changes in any of the following ways: * Column added * Column removed * Data type changed ## Create a Schema Change monitor There are two ways to create a Schema Change monitor: 1. Open the **Monitors** page, select **Create new monitor**, and then choose **Schema Change**. 2. Clone an existing Schema Change monitor by clicking **Actions** and then **Clone**. This will pre-fill the form with the existing monitor configuration. ## Set up your monitor To set up a Schema Change monitor, simply select your data connection and the table you wish to monitor for changes. ## Add a schedule You can choose to run your monitor daily, hourly, or even input a cron expression for more complex scheduling: ## Add notifications Receive notifications via Slack or email when at least one record fails your test: ## FAQ Yes, but in a different context. While data diffs report on schema differences *between two tables at the same time* (unless you’re using the time travel feature), data diff monitors alert you to schema changes for the *same table over time*. ## Need help? If you have any questions about how to use Schema Change monitors, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com). --- # Source: https://docs.datafold.com/api-reference/lineagev2/search-entities.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Search Entities > Search for datasets and columns by name. Args: q: Search query string (minimum 2 characters). Searches in dataset/column names and IDs. limit: Maximum number of results to return per type (default: 50) Returns: SearchResponse containing: - datasets: List of matching tables/views with metadata (asset type, column count, row count, popularity) - columns: List of matching columns with table context and popularity Example: - Search for tables: q="customer" returns all datasets with "customer" in the name - Search for columns: q="email" returns all columns with "email" in the name ## OpenAPI ````yaml openapi-public.json get /api/internal/lineagev2/search openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/internal/lineagev2/search: get: tags: - lineagev2 summary: Search Entities description: |- Search for datasets and columns by name. Args: q: Search query string (minimum 2 characters). Searches in dataset/column names and IDs. limit: Maximum number of results to return per type (default: 50) Returns: SearchResponse containing: - datasets: List of matching tables/views with metadata (asset type, column count, row count, popularity) - columns: List of matching columns with table context and popularity Example: - Search for tables: q="customer" returns all datasets with "customer" in the name - Search for columns: q="email" returns all columns with "email" in the name operationId: search_entities_api_internal_lineagev2_search_get parameters: - in: query name: q required: true schema: title: Q type: string - in: query name: limit required: false schema: default: 50 title: Limit type: integer responses: '200': content: application/json: schema: $ref: '#/components/schemas/SearchResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: SearchResponse: properties: columns: items: $ref: '#/components/schemas/ColumnSearchResult' title: Columns type: array datasets: items: $ref: '#/components/schemas/DatasetSearchResult' title: Datasets type: array required: - datasets - columns title: SearchResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ColumnSearchResult: properties: id: title: Id type: string isSource: anyOf: - type: boolean - type: 'null' title: Issource name: title: Name type: string popularity: default: 0 title: Popularity type: number tableId: title: Tableid type: string tableName: title: Tablename type: string totalQueries30d: anyOf: - type: integer - type: 'null' title: Totalqueries30D required: - id - name - tableId - tableName title: ColumnSearchResult type: object DatasetSearchResult: properties: assetType: title: Assettype type: string columnCount: anyOf: - type: integer - type: 'null' title: Columncount definitionSql: anyOf: - type: string - type: 'null' title: Definitionsql id: title: Id type: string isSource: anyOf: - type: boolean - type: 'null' title: Issource name: title: Name type: string popularity: default: 0 title: Popularity type: number rowCount: anyOf: - type: integer - type: 'null' title: Rowcount statementType: anyOf: - type: string - type: 'null' title: Statementtype totalQueries30d: anyOf: - type: integer - type: 'null' title: Totalqueries30D required: - id - name - assetType title: DatasetSearchResult type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/lineagev2/search-for-datasets-and-columns-in-the-lineage-graph.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Search for datasets and columns in the lineage graph > Search for datasets (tables, views) and columns by name in the lineage graph. Returns matching datasets and columns with metadata including popularity scores, query counts, and structural information. Results are ranked by name match. Use this to discover data assets before exploring their lineage relationships. ## OpenAPI ````yaml openapi-public.json get /api/v1/lineagev2/search openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineagev2/search: get: tags: - lineagev2 summary: Search for datasets and columns in the lineage graph description: >- Search for datasets (tables, views) and columns by name in the lineage graph. Returns matching datasets and columns with metadata including popularity scores, query counts, and structural information. Results are ranked by name match. Use this to discover data assets before exploring their lineage relationships. operationId: lineagev2_search parameters: - in: query name: q required: true schema: title: Q type: string - in: query name: limit required: false schema: default: 50 title: Limit type: integer responses: '200': content: application/json: schema: $ref: '#/components/schemas/SearchResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: SearchResponse: properties: columns: items: $ref: '#/components/schemas/ColumnSearchResult' title: Columns type: array datasets: items: $ref: '#/components/schemas/DatasetSearchResult' title: Datasets type: array required: - datasets - columns title: SearchResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ColumnSearchResult: properties: id: title: Id type: string isSource: anyOf: - type: boolean - type: 'null' title: Issource name: title: Name type: string popularity: default: 0 title: Popularity type: number tableId: title: Tableid type: string tableName: title: Tablename type: string totalQueries30d: anyOf: - type: integer - type: 'null' title: Totalqueries30D required: - id - name - tableId - tableName title: ColumnSearchResult type: object DatasetSearchResult: properties: assetType: title: Assettype type: string columnCount: anyOf: - type: integer - type: 'null' title: Columncount definitionSql: anyOf: - type: string - type: 'null' title: Definitionsql id: title: Id type: string isSource: anyOf: - type: boolean - type: 'null' title: Issource name: title: Name type: string popularity: default: 0 title: Popularity type: number rowCount: anyOf: - type: integer - type: 'null' title: Rowcount statementType: anyOf: - type: string - type: 'null' title: Statementtype totalQueries30d: anyOf: - type: integer - type: 'null' title: Totalqueries30D required: - id - name - assetType title: DatasetSearchResult type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/security/securing-connections.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Securing Connections > Datafold supports multiple options to secure connections between your resources (e.g., databases and BI tools) and Datafold. ## Encryption When you connect to Datafold to query your data in a database (e.g., BigQuery), communications are secured using HTTPS encryption. ## IP whitelisting If access to your data connection is restricted to IP addresses on an allowlist, you will need to manually add Datafold's addresses in order to use our product. Otherwise, you will receive a connection error when setting up your data connection. For SaaS (app.datafold.com) deployments, whitelist the following IP addresses: * `23.23.71.47` * `35.166.223.86` * `52.11.132.23` * `54.71.177.163` * `54.185.25.103` * `54.210.34.216` Note that at any given time, you will only see one of these addresses in use. However, the active IP address can change, so you should add them all to your IP whitelist to ensure no interruptions in service. ## Private Link ### AWS PrivateLink AWS PrivateLink allows you to connect Datafold to your databases without exposing data to the internet. This option is available for both Datafold SaaS Cloud and all Datafold Dedicated Cloud options. The following diagram shows the architecture for a customer with a High Availability RDS setup: ### Setup Supported databases The following setup assumes you have an RDS/Aurora database you want to connect to. Datafold also supports PrivateLink connections to other databases such as Snowflake, which should only be accessed from your VPC. Please contact [support@datafold.com](mailto:support@datafold.com) to get assistance with connecting to your specific database. Our support team will send you the following: * The role ARN to establish the PrivateLink connection. * Datafold SaaS Cloud VPC CIDR range. You need to do the following steps: 1. Send us the region(s) where your database(s) are located. 2. Create a VPC Endpoint Service and NLB. * The core concepts of this setup are described in this AWS blog: [Access Amazon RDS across VPCs using AWS PrivateLink and Network Load Balancer](https://aws.amazon.com/blogs/database/access-amazon-rds-across-vpcs-using-aws-privatelink-and-network-load-balancer/). * If your databases are HA, please implement the failover mechanics described in the blog. * A CloudFormation template for inspiration can be found [here](https://github.com/aws-samples/amazon-rds-crossaccount-access/blob/main/CrossAccountRDSAccess.yml). * You'll need to create a Network Load Balancer that points to your database and a VPC Endpoint Service that exposes the NLB. * Configure security groups to allow traffic from Datafold's VPC to your database. * If your databases are HA (High Availability), implement automatic failover mechanics to ensure the NLB routes to the active database instance. * For detailed step-by-step instructions, see our [**AWS PrivateLink Setup Guide**](/security/aws_privatelink_setup). 3. Add the provided role ARN as 'Allowed Principal' on the VPC Endpoint Service. 4. Allow ingress from the Datafold SaaS Cloud VPC. 5. Send us the: * Service name(s), e.g. `com.amazonaws.vpce.us-west-2.vpce-svc-0cfd2f258c4395ad6`. * Availability Zone ID(s) used in the VPCE Service(s), e.g. `use1-az6` or `usw2-az3`. * RDS/Aurora hostname(s), e.g. `datafold.c2zezoge6btk.us-west-2.rds.amazonaws.com`. At the end, the database hostname used to configure the data source will be the original RDS/Aurora hostname. But with private DNS resolution, we will resolve the hostname to the VPC Endpoint. Our support team will let you know when everything is set up and you can accept the PrivateLink connection and start configuring the data source. **Detailed Instructions** For comprehensive step-by-step instructions including security group configuration, target group setup, Lambda-based automatic failover for HA setups, and troubleshooting, see our [**AWS PrivateLink Setup Guide**](/security/aws_privatelink_setup). ### Cross-Region PrivateLink Datafold SaaS Cloud supports cross-region PrivateLink for all North American regions. Datafold SaaS Cloud is located in `us-west-2`. Datafold manages the cross-region networking, allowing you to connect to a VPC Endpoint in the same region as your VPC Endpoint Service. For Datafold Dedicated Cloud customers, deployment occurs in your chosen region. If you need to connect to databases in multiple regions, Datafold also supports this through cross-region PrivateLink. The setup will be similar to the regular PrivateLink setup. ### Private Service Connect Google Cloud's Private Service Connect is only available if both parties are in the same cloud region. This option is only available for Datafold Dedicated Cloud customers. The diagram below illustrates how the solution works: The basics of Private Service Connect are available [here](https://cloud.google.com/vpc/docs/private-service-connect). ### Azure Private Link Azure Private Link is only available if both parties are in the same cloud region. This option is only available for Datafold Dedicated Cloud customers. The diagram below illustrates how the solution works: The basics of Private Link are available [here](https://learn.microsoft.com/en-us/azure/private-link/private-link-overview). For Customer-Hosted Dedicated Cloud, achieving cross-tenant access requires using Private Link. The documentation can be accessed [here](https://learn.microsoft.com/en-us/azure/architecture/guide/networking/cross-tenant-secure-access-private-endpoints). ## VPC Peering (SaaS) VPC Peering is easier to set up than Private Link, but a drawback is that both networks are joined and the IP ranges must not overlap. For Datafold SaaS Cloud, this setup is an AWS-only option. The basics of VPC peering are covered [here](https://docs.aws.amazon.com/vpc/latest/peering/vpc-peering-basics.html). To set up VPC peering, please contact [support@datafold.com](mailto:support@datafold.com) and provide us with the following information: * AWS region where your database is hosted. * ID of the VPC that you would like to connect. * CIDR of the VPC. If there are no address collisions, we'll send you a peering request and CIDR that we use on our end, and whitelist the CIDR range for your organization. You'll need to set up routing to this CIDR through the peering connection. If you activate DNS on your side of the peering connection, you can use the private DNS hostname to connect. Otherwise, you need to use the IP. ## VPC Peering (Dedicated Cloud) VPC Peering is a supported option for all cloud providers, both for Datafold-hosted and customer-hosted deployments. Basic information for each cloud provider can be found here: * [AWS](https://docs.aws.amazon.com/vpc/latest/peering/vpc-peering-basics.html) * [GCP](https://cloud.google.com/vpc/docs/vpc-peering) * [Azure](https://learn.microsoft.com/en-us/azure/virtual-network/create-peering-different-subscriptions?tabs=create-peering-portal) **VPC vs VNet** We use the term VPC across all major cloud providers. However, Azure calls this concept a Virtual Network (VNet). ## SSH Tunnel To set up a tunnel, please contact our team at [support@datafold.com](mailto:support@datafold.com) and provide the following information: * Hostname of your bastion host and port number used for SSH service. * Hostname of and port number of your database. * SSH fingerprint of the bastion host (optional). We'll get back to you with: * SSH public key that you need to add to `~/.ssh/authorized_hosts`. * IP address and port to use for data connection configuration in the Datafold application. ## IPSec tunnel Please contact our team at [support@datafold.com](mailto:support@datafold.com) for more information. --- # Source: https://docs.datafold.com/security/single-sign-on.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Single Sign-On > Set up Single Sign-On with one of the following options. **Tip** You can force all users to use the configured SSO provider by unchecking the *Allow non-admin users to login with email and password* checkbox under the organization settings. Admin users will still be able to login using email and password. **Caution** Ensure only authorized users keep using Datafold by setting up Okta webhooks or setting up credentials for the Microsoft Entra app if you're using Microsoft Entra ID (formerly known Azure Active Directory) This will disable non-admin users that don't have access to the configured SSO app. [Configure this for Okta](/security/single-sign-on/okta#synchronize-state-with-datafold-optional) [Configure this for Microsoft Entra ID](/security/single-sign-on/saml/examples/microsoft-entra-id-configuration#synchronize-user-with-datafold-optional) --- # Source: https://docs.datafold.com/integrations/notifications/slack.md # Slack > Receive notifications for monitors in Slack. ## Prerequisites * Slack admin access or permissions to manage integrations * A Datafold account with admin privileges ## Configure the Integration 1. In Datafold, go to Settings > Integrations > Notifications 2. Click "Add New Integration" 3. Select "Slack" 4. You'll be automatically redirected to Slack 5. If you're not already signed in, sign in to your Slack account 6. Click "Allow" to grant Datafold the necessary permissions 7. You'll be redirected back to Datafold You're all set! When you configure a monitor in Datafold, you'll now have the option to send notifications to Slack. ## Monitors as Code Configuration If you're using [monitors as code](/data-monitoring/monitors-as-code), you can configure Slack notifications by adding a `notifications` section to your monitor definition as follows: ```yaml theme={null} monitors: : ... notifications: - type: slack integration: channel: mentions: - - here - channel ... ``` * `` can be found in Datafold -> Settings -> Integrations -> Notifications -> \ #### Full example ```yaml theme={null} monitors: uniqueness_test_example: type: test enabled: true connection_id: 1123 test: type: unique tables: - path: DEV.DATA_DEV.USERS columns: - USERNAME schedule: interval: every: hour notifications: - type: slack integration: 13 channel: dev-notifications mentions: - John Doe - channel ``` ## Need help? If you have any questions about integrating with Slack, please reach out to our team via Slack, in-app chat, or email us at [support@datafold.com](mailto:support@datafold.com). --- # Source: https://docs.datafold.com/deployment-testing/best-practices/slim-diff.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Slim Diff > Choose which downstream tables to diff to optimize time, cost, and performance. By default, Datafold diffs all modified models and downstream models. However, it won't make sense for all organizations to diff every downstream table every time you make a code update. Tradeoffs of time, cost, and risk must be considered. That's why we created Slim Diff. With Slim Diff enabled, Datafold will only diff models with dbt code changes in your Pull Request (PR). ## Setting up Slim Diff In Datafold, Slim Diff can be enabled by adjusting your diff settings by navigating to Settings → Integrations → CI → Select your CI tool → Advanced Settings and check the Slim Diff box: ## Diffing only modified models With this setting turned on, only the modified models will be diffed by default. ## Diff individual downstream models Once Datafold has diffed only the modified models, you still have the option of diffing individual downstream models right within your PR. ## Diff all downstream models You can also add the `datafold:diff-all-downstream` label within your PR, which will automatically diff *all* downstream models. ## Explicitly define which models to always diff Finally, with Slim Diff turned on, there might be certain models or subdirectories that you want to *always* diff when downstream. You can think of this as an exclusion to the Slim Diff behavior. Apply the `slim_diff: diff_when_downstream` meta tag to individual models or entire folders in your `dbt_project.yml` file: ```Bash theme={null} models: : : +materialized: view : +meta: datafold: datadiff: slim_diff: diff_when_downstream : +meta: datafold: datadiff: slim_diff: diff_when_downstream ``` These meta tags can also be added in individual yaml files or in config blocks. More details about using meta tags are available in [the dbt docs](https://docs.getdbt.com/reference/resource-configs/meta). With this configuration in place, Slim Diff will prevent downstream models from being run *unless* they have been designated as exceptions with the `slim_diff: diff_when_downstream` dbt meta tag. As usual, once the PR has been opened, you'll still have the option of diffing individual downstream models that weren't diffed, or diffing all downstream models using the `datafold:diff-all-downstream` label. --- # Source: https://docs.datafold.com/integrations/databases/snowflake.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Snowflake **NOTE**: Datafold needs permissions in your Snowflake dataset to read your table data. You will need to be a Snowflake *Admin* in order to grant the required permissions. **Steps to complete:** * [Create a user and role for Datafold](/integrations/databases/snowflake#create-a-user-and-role-for-datafold) * [Setup password-based](/integrations/databases/snowflake#set-up-password-based-authentication) or [Use key-pair authentication](/integrations/databases/snowflake#use-key-pair-authentication) * [Create a temporary schema](/integrations/databases/snowflake#create-schema-for-datafold) * [Give the Datafold role access to your warehouse](/integrations/databases/snowflake#give-the-datafold-role-access) * [Configure your data connection in Datafold](/integrations/databases/snowflake#configure-in-datafold) ## Create a user and role for Datafold > A [full script](/integrations/databases/snowflake#full-script) can be found at the bottom of this page. It is best practice to create a separate role for the Datafold integration (e.g., `DATAFOLDROLE`): ``` CREATE ROLE DATAFOLDROLE; CREATE USER DATAFOLD DEFAULT_ROLE = "DATAFOLDROLE" MUST_CHANGE_PASSWORD = FALSE; GRANT ROLE DATAFOLDROLE TO USER DATAFOLD; ``` To provide column-level lineage, Datafold needs to read & parse all SQL statements executed in your Snowflake account: ``` GRANT MONITOR EXECUTION ON ACCOUNT TO ROLE DATAFOLDROLE; GRANT IMPORTED PRIVILEGES ON DATABASE SNOWFLAKE TO ROLE DATAFOLDROLE; ``` ## Set up password-based authentication Datafold supports username/password authentication, but also key-pair authentication. ``` ALTER USER DATAFOLD SET PASSWORD = 'SomethingSecret'; ``` You can set the username/password in the Datafold web UI. ### Use key-pair authentication If you would like to use key-pair authentication, go to **Settings** -> **Data Connections** -> **Your Snowflake Connection**, and change Authentication method from **Password** to **Key Pair**. Generate and Download the Key Pair file, and use the value within the file when running the following command in Snowflake to set the key for this Snowflake role: ``` ALTER USER DATAFOLD SET rsa_public_key='...' ``` ## Create schema for Datafold Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse. ``` CREATE SCHEMA .DATAFOLD_TMP; GRANT ALL ON SCHEMA .DATAFOLD_TMP TO DATAFOLDROLE; ``` ## Give the Datafold role access Datafold will only scan the tables that it has access to. The snippet below will give Datafold read access to a database. If you have more than one database that you want to use in Datafold, rerun the script below for each one. ```Bash theme={null} /* Repeat for every DATABASE to be usable in Datafold. This allows Datafold to correctly discover, profile & diff each table */ GRANT USAGE ON WAREHOUSE TO ROLE DATAFOLDROLE; GRANT USAGE ON DATABASE TO ROLE DATAFOLDROLE; GRANT USAGE ON ALL SCHEMAS IN DATABASE TO ROLE DATAFOLDROLE; GRANT USAGE ON FUTURE SCHEMAS IN DATABASE TO ROLE DATAFOLDROLE; GRANT SELECT ON ALL TABLES IN DATABASE TO ROLE DATAFOLDROLE; GRANT SELECT ON FUTURE TABLES IN DATABASE TO ROLE DATAFOLDROLE; GRANT SELECT ON ALL VIEWS IN DATABASE TO ROLE DATAFOLDROLE; GRANT SELECT ON FUTURE VIEWS IN DATABASE TO ROLE DATAFOLDROLE; GRANT SELECT ON ALL MATERIALIZED VIEWS IN DATABASE TO ROLE DATAFOLDROLE; GRANT SELECT ON FUTURE MATERIALIZED VIEWS IN DATABASE TO ROLE DATAFOLDROLE; GRANT ALL PRIVILEGES ON ALL DYNAMIC TABLES IN DATABASE TO ROLE DATAFOLDROLE; GRANT SELECT ON FUTURE DYNAMIC TABLES IN DATABASE TO ROLE DATAFOLDROLE; ``` ## Full Script ```Bash theme={null} --Step 1: Create a user and role for Datafold CREATE ROLE DATAFOLDROLE; CREATE USER DATAFOLD DEFAULT_ROLE = "DATAFOLDROLE" MUST_CHANGE_PASSWORD = FALSE; GRANT ROLE DATAFOLDROLE TO USER DATAFOLD; GRANT MONITOR EXECUTION ON ACCOUNT TO ROLE DATAFOLDROLE; GRANT IMPORTED PRIVILEGES ON DATABASE SNOWFLAKE TO ROLE DATAFOLDROLE; --Step 2a: Use password-based authentication ALTER USER DATAFOLD SET PASSWORD = 'SomethingSecret'; --OR --Step 2b: Use key-pair authentication --ALTER USER DATAFOLD SET rsa_public_key='abc..' --Step 3: Create schema for Datafold CREATE SCHEMA .DATAFOLD_TMP; GRANT ALL ON SCHEMA .DATAFOLD_TMP TO DATAFOLDROLE; --Step 4: Give the Datafold role access to your data connection /* Repeat for every DATABASE to be usable in Datafold. This allows Datafold to correctly discover, profile & diff each table */ GRANT USAGE ON WAREHOUSE TO ROLE DATAFOLDROLE; GRANT USAGE ON DATABASE TO ROLE DATAFOLDROLE; GRANT USAGE ON ALL SCHEMAS IN DATABASE TO ROLE DATAFOLDROLE; GRANT USAGE ON FUTURE SCHEMAS IN DATABASE TO ROLE DATAFOLDROLE; GRANT SELECT ON ALL TABLES IN DATABASE TO ROLE DATAFOLDROLE; GRANT SELECT ON FUTURE TABLES IN DATABASE TO ROLE DATAFOLDROLE; GRANT SELECT ON ALL VIEWS IN DATABASE TO ROLE DATAFOLDROLE; GRANT SELECT ON FUTURE VIEWS IN DATABASE TO ROLE DATAFOLDROLE; GRANT SELECT ON ALL MATERIALIZED VIEWS IN DATABASE TO ROLE DATAFOLDROLE; GRANT SELECT ON FUTURE MATERIALIZED VIEWS IN DATABASE TO ROLE DATAFOLDROLE; ``` ## Validate Snowflake Grants for Datafold Run these queries to validate that the grants have been set up correctly: > Note: More results may be returned than shown in the screenshots below if you have granted access to multiple roles/users Example Placeholders: * `` = `DEV` * `` = `DEMO` ``` -- Validate database usage for the DATAFOLDROLE SHOW GRANTS ON DATABASE ; ``` ``` -- Validate warehouse usage for the DATAFOLDROLE SHOW GRANTS ON WAREHOUSE ; ``` ``` -- Validate schema permissions for the DATAFOLDROLE SHOW GRANTS ON SCHEMA .DATAFOLD_TMP; ``` ## A note on future grants The above database grants will be insufficient if any future grants have been defined at the schema level, because [schema-level grants will override database-level grants](https://docs.snowflake.com/en/sql-reference/sql/grant-privilege#considerations). In that case, you will need to execute future grants for every existing *schema* that Datafold will operate on. ```Bash theme={null} GRANT SELECT ON FUTURE TABLES IN SCHEMA . TO ROLE DATAFOLDROLE; GRANT SELECT ON FUTURE VIEWS IN SCHEMA . TO ROLE DATAFOLDROLE; GRANT SELECT ON FUTURE MATERIALIZED VIEWS IN SCHEMA . TO ROLE DATAFOLDROLE; GRANT SELECT ON ALL TABLES IN SCHEMA . TO ROLE DATAFOLDROLE; GRANT SELECT ON ALL VIEWS IN SCHEMA . TO ROLE DATAFOLDROLE; GRANT SELECT ON ALL MATERIALIZED VIEWS IN SCHEMA . TO ROLE DATAFOLDROLE; ``` ## Configure in Datafold | Field Name | Description | | --------------------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Name | A name given to the data connection within Datafold | | Account identifier | The Org name-Account name pair for your Snowflake account. This can be found in the browser address string. It may look like [https://orgname-accountname.snowflakecomputing.com](https://orgname-accountname.snowflakecomputing.com) or [https://app.snowflake.com/orgname/accountname](https://app.snowflake.com/orgname/accountname). In the setup form, enter \-\. | | User | The username set in the [Setup password-based](/integrations/databases/snowflake#set-up-password-based-authentication) authentication section | | Password | The password set in the [Setup password-based](/integrations/databases/snowflake#set-up-password-based-authentication) authentication section | | Key Pair file | The key file generated in the [Use key-pair authentication](/integrations/databases/snowflake#use-key-pair-authentication) section | | Warehouse | The Snowflake warehouse name | | Schema for temporary tables | The schema name you created with our script (\.DATAFOLD\_TMP) | | Role | The role you created for Datafold (Typically DATAFOLDROLE) | | Default DB | A database the role above can access. If more than one database was added, whichever you prefer to be the default | > Note: Please review the documentation for the account name. Datafold uses Format 1 (Preferred): [https://docs.snowflake.com/en/user-guide/admin-account-identifier#using-an-account-locator-as-an-identifier](https://docs.snowflake.com/en/user-guide/admin-account-identifier#using-an-account-locator-as-an-identifier) Click **Create**. Your data connection is ready! --- # Source: https://docs.datafold.com/deployment-testing/configuration/datafold-ci/specifc.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Running Data Diff on Specific Branches > By default, Datafold CI runs on every new pull/merge request and commits to existing ones. You can set **Custom base branch** option in the Datafold app [CI settings](https://app.datafold.com/settings/integrations/ci), to only run Datafold CI on pull requests that have a specific base branch. This might be useful if you have multiple environments built from different branches. For example, `staging` and `production` environments built from `staging` and `main` branches respectively. Using the option, you can have 2 different CI configurations in Datafold, one for each environment, and only run the CI for the corresponding branch. --- # Source: https://docs.datafold.com/deployment-testing/configuration/model-specific-ci/sql-filters.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # SQL Filters > Use dbt YAML configuration to set model-specific filters for Datafold CI. SQL filters can be helpful in two scenarios: 1. When **Production** and **Staging** environments are not built using the same data. For example, if **Staging** is built using a subset of production data, filters can be applied to ensure that both environments are on par and can be diffed. 2. To improve Datafold CI performance by reducing the volume of data compared, e.g., only comparing the last 3 months of data. SQL filters are an effective technique to speed up diffs by narrowing the data diffed. A SQL filter adds a `WHERE` clause to allow you to filter data on both sides using standard SQL filter expressions. They can be added to dbt YAML under the `meta.datafold.datadiff.filter` tag: ``` models: - name: users meta: datafold: datadiff: filter: "user_id > 2350 AND source_timestamp >= current_date() - 7" ``` --- # Source: https://docs.datafold.com/integrations/databases/sql-server.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Microsoft SQL Server **INFO** Column-level Lineage is not currently supported for Microsoft SQL Server. **Steps to complete:** 1. [Run SQL script and create schema for Datafold](/integrations/databases/sql-server#run-sql-script-and-create-schema-for-datafold) 2. [Configure your data connection in Datafold](/integrations/databases/sql-server#configure-in-datafold) ## Run SQL script and create schema for Datafold To connect to Microsoft SQL Server, create a user with read-only access to all tables you wish to diff. Include read and write access to a Datafold-specific temp schema: ```Bash theme={null} /* Select the database that will contain the temp schema */ USE DatabaseName; /* Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse. */ CREATE SCHEMA datafold_tmp; /* Create the Datafold user */ CREATE LOGIN DatafoldUser WITH PASSWORD = 'SOMESECUREPASSWORD'; CREATE USER DatafoldUser FOR LOGIN DatafoldUser; /* Allow the user to create views */ GRANT CREATE VIEW TO DatafoldUser; /* Grant read access to diff tables */ GRANT SELECT ON SCHEMA::YourSchema TO DatafoldUser; /* Grant read + write access to datafold_tmp schema */ GRANT CONTROL ON SCHEMA::datafold_tmp TO DatafoldUser; ``` ## Configure in Datafold | Field Name | Description | | ---------------------------- | ---------------------------------------------------------------------------------------------------------------- | | Connection name | A name given to the data connection within Datafold | | Host | The hostname for your SQL Server instance | | Port | SQL Server connection port; default value is 1433 | | Username | The user created in our SQL script, named DatafoldUser | | Password | The password created in our SQL script | | Database | The name of the SQL Server database you want to connect to | | Dataset for temporary tables | The schema created in our SQL script, in database.schema format: DatabaseName.datafold\_tmp in our script above. | Click **Create**. Your data connection is ready! --- # Source: https://docs.datafold.com/integrations/databases/starburst.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Starburst **INFO** Column-level Lineage is not currently supported for Starburst. **Steps to complete:** 1. [Configure user in Starburst](#configure-user-in-starburst) 2. [Create schema for Datafold](#create-schema-for-datafold) 3. [Configure your data connection in Datafold](#configure-in-datafold) ## Configure user in Starburst To connect to Starburst, create a user with read-only access to all data sources you wish to diff and optionally generate an access token. Datafold requires a schema to be set up within one of the catalogs, typically hosted on platforms like Amazon S3 or similar services. ## Create schema for Datafold Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse. ## Configure in Datafold | Field Name | Description | | --------------------------- | -------------------------------------------------------------------------------------------------------------------- | | Connection name | A name given to the data connection within Datafold. | | Host | The hostname for your Starburst instance (e.g., `sample-free-cluster.trino.galaxy.starburst.io` for Starburst SaaS). | | Port | Starburst endpoint port; default value is 433. | | Encryption | Should be checked for Starburst Galaxy, possibly unchecked for local deployments. | | User ID | User ID as created in Starburst, typically an email address. | | Token | Access token generated in Starburst. | | Password | Alternatively, provide a password. | | Schema for temporary tables | Use `.` format. | Click **Create**. Your data source is now ready! --- # Source: https://docs.datafold.com/api-reference/dma_v2/start-a-dma-translation-job.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Start a DMA translation job > Start a translation job for a DMA project. Executes the DMA translation pipeline to convert source SQL code to target dialect. The pipeline processes code through multiple stages (file operations, reference extraction, template creation, SQL translation, validation, and bundling). This endpoint launches a long-running background workflow and returns immediately with a job_id. Use the get_translation_status endpoint to poll for progress and results. ## OpenAPI ````yaml openapi-public.json post /api/v1/dma/v2/projects/{project_id}/translate/jobs openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/dma/v2/projects/{project_id}/translate/jobs: post: tags: - DMA_V2 summary: Start a DMA translation job description: >- Start a translation job for a DMA project. Executes the DMA translation pipeline to convert source SQL code to target dialect. The pipeline processes code through multiple stages (file operations, reference extraction, template creation, SQL translation, validation, and bundling). This endpoint launches a long-running background workflow and returns immediately with a job_id. Use the get_translation_status endpoint to poll for progress and results. operationId: start_translation parameters: - in: path name: project_id required: true schema: title: Project Id type: integer requestBody: content: application/json: schema: $ref: '#/components/schemas/ApiTranslateRequest' default: concurrency: 12 fail_fast: false preserve_dbt_temp_dirs: false recreate_all: false responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiTranslateTask' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ApiTranslateRequest: description: Request to run translation pipeline. properties: asset_paths: anyOf: - items: type: string type: array - type: 'null' title: Asset Paths concurrency: default: 12 title: Concurrency type: integer drop_unresolved: anyOf: - items: type: string type: array - type: 'null' title: Drop Unresolved fail_fast: default: false title: Fail Fast type: boolean identity: anyOf: - additionalProperties: type: string type: object - type: 'null' title: Identity include_unverified: anyOf: - items: type: string type: array - type: 'null' title: Include Unverified max_iterations: anyOf: - type: integer - type: 'null' title: Max Iterations preserve_dbt_temp_dirs: default: false title: Preserve Dbt Temp Dirs type: boolean recreate_all: default: false title: Recreate All type: boolean stages: anyOf: - items: type: string type: array - type: 'null' title: Stages transform_group_ids: anyOf: - items: type: string type: array - type: 'null' title: Transform Group Ids title: ApiTranslateRequest type: object ApiTranslateTask: description: Response for translation task. properties: status: $ref: '#/components/schemas/JobStatus' task_id: title: Task Id type: string translated_models: anyOf: - items: $ref: '#/components/schemas/ApiTranslatedModel' type: array - type: 'null' title: Translated Models required: - task_id - status title: ApiTranslateTask type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object JobStatus: enum: - needs_confirmation - needs_authentication - waiting - processing - done - failed - cancelled title: JobStatus type: string ApiTranslatedModel: description: Information about a translated model. properties: asset_id: title: Asset Id type: string asset_name: title: Asset Name type: string datadiff_id: anyOf: - type: integer - type: 'null' title: Datadiff Id failure_summary: anyOf: - $ref: '#/components/schemas/ApiFailureSummary' - type: 'null' source_filename: anyOf: - type: string - type: 'null' title: Source Filename source_sql: anyOf: - type: string - type: 'null' title: Source Sql target_sql: anyOf: - type: string - type: 'null' title: Target Sql translation_status: $ref: '#/components/schemas/ApiTranslationStatus' required: - asset_name - asset_id - translation_status title: ApiTranslatedModel type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object ApiFailureSummary: description: Structured failure summary with problem, error, and solution sections. properties: error_message: title: Error Message type: string location: anyOf: - type: string - type: 'null' title: Location problem: title: Problem type: string reason: $ref: '#/components/schemas/ApiFailureReason' solution: title: Solution type: string required: - problem - error_message - solution - reason title: ApiFailureSummary type: object ApiTranslationStatus: enum: - no_translation_attempts - validation_pending - invalid_translation - valid_translation title: ApiTranslationStatus type: string ApiFailureReason: description: Reasons why a translation agent failed to complete its task. enum: - max_iterations - tool_error - resignation title: ApiFailureReason type: string securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/support/support.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Support > Datafold offers multiple support channels to assist users with troubleshooting and inquiries. ## Datafold Support * **Email**: Contact support at [support@datafold.com](mailto:support@datafold.com) for any assistance. * **In-app Chat**: Reach out directly from the Datafold app via live chat for quick help. * **Shared Slack Channel**: Collaborate with the Datafold team through a dedicated Slack channel (please inquire with your account executive to set up). * **FAQ**: Explore our [Frequently Asked Questions](/faq/overview) for detailed answers to common queries and troubleshooting tips. ### Grant access to Datafold's team for troubleshooting For faster resolution of support issues, you can temporarily grant Datafold Support access to your account. This enables a Datafold team member to view the same in-app context as you, minimizing back-and-forth communication. To grant access: 1. Navigate to **Settings** → **Org Settings**. 2. Check the box next to *"Allow Datafold access to your account for troubleshooting purposes."* To revoke access, simply uncheck the box at any time. **Note:** Admin privileges are required to modify this setting in Org Settings. --- # Source: https://docs.datafold.com/api-reference/bi/sync-a-bi-integration.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Sync a BI integration > Start an unscheduled synchronization of the integration. ## OpenAPI ````yaml get /api/v1/lineage/bi/{bi_datasource_id}/sync/ openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineage/bi/{bi_datasource_id}/sync/: get: tags: - BI summary: Sync a BI integration description: Start an unscheduled synchronization of the integration. operationId: resync_integration_api_v1_lineage_bi__bi_datasource_id__sync__get parameters: - in: path name: bi_datasource_id required: true schema: title: BI integration id type: integer responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/integrations/bi-data-apps/tableau.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Tableau > Visualize downstream Tableau dependencies and understand how warehouse changes impact your BI layer. ## Overview Our Tableau integration can help you visualise column-level lineage dependencies between warehouse tables and Tableau entities using [Data Explorer](/data-explorer/how-it-works). **Note:** Lineage is only supported for Tableau assets in **Live** mode. Assets in **Extract** mode will not appear in Datafold lineage or dependency views. Lineage from upstream data warehouses into Tableau is supported for the following data warehouse types: * Snowflake * Redshift * Databricks * BigQuery Potentially impacted Tableau entity names are also automatically identified in the Datafold CI printout. The following Tableau entities types will appear in Data Explorer, data diff results, and the Datafold CI printout: * Tableau **Data Connections** and related fields; * **Workbooks** and related fields; * **Dashboards**. To declutter the Datafold lineage, Datafold filters out Tableau Data Connections and Data Connections fields that have no downstream dependencies. If you're interested in learning more about the Datafold integration, [please reach out to our team](https://www.datafold.com/booktime). ## Set up your Tableau instance To connect Datafold to Tableau, you will require the following credentials from your Tableau site: * Server URL, * Site Name, * Token Name, * Token Value. ## If you are using Tableau Server **Tableau Server** is an installation of Tableau that you are managing on your company's own infrastructure and domain. This is an alternative to using a Tableau Cloud subscription. * Make sure that the [metadata-services](https://help.tableau.com/current/server/en-us/cli%5Fmaintenance%5Ftsm.htm#cat%5Fenable) are enabled by running the following command: ``` tsm maintenance metadata-services enable ``` * Ensure that your Tableau Server instance is accessible to Datafold. Please get in touch with our team to set this up. ## Obtaining server URL & Site Name These can be found from URL of your Tableau home page. For instance, if your home page is: ``` https://eu-west-1a.online.tableau.com/#/site/mysupersite/home ``` Then: * **Server URL** is `https://eu-west-1a.online.tableau.com` (the hostname with `https` in front) * **Site Name** is `mysupersite` (the part directly after `#/site/` and until the next `/`) ## Obtaining Token Name & Token Value[](#obtaining-token-name--token-value "Direct link to Obtaining Token Name & Token Value") Ensure that **Personal Access Tokens** are enabled on your Tableau site. For that, navigate to **Settings** and there, on the **General** tab, search for `Personal Access Tokens`. That feature needs to be enabled — not necessarily for everyone but for the user for whom we will be creating the token Datafold will use. Enable Personal Access Tokens Now that Personal Access Tokens are enabled, click on your user’s avatar in the top right, choose **My Account Settings** in the pop-up menu, and then search for **Personal Access Tokens** on your settings page. Personal Access Token Input a desired name, say `datafold`, into the **Token Name** field, and click **Create Token**. This will open a popup window. Click **Copy Secret** and save the copied value somewhere — you will use this when setting up Datafold. You can read more about personal access tokens on the official Tableau documentation [here](https://help.tableau.com/current/server/en-us/security%5Fpersonal%5Faccess%5Ftokens.htm). ## Create a Tableau Integration Navigate to ** Settings** → **Integrations** → **Data Apps**. Click ** Add new integration**. Add New Integration A click on **Tableau** will lead you to the integration creation screen. Fill in the fields with data we obtained earlier. See the screenshot for hints. Tableau Integration Settings …and click **Save**. ## What's next? The initial sync might take some time; it depends on the number of objects at your Tableau site. Eventually, Tableau entities — **Data Connections**, **Workbooks**, and **Dashboards** should appear at your **Lineage** tab. **TIP** [Tracking Jobs](/integrations/bi-data-apps/tracking-jobs) explains how to find out when your data app integration is ready. Search Tableau Entities Clicking on a Tableau entity will lead you to the Lineage screen: Tableau Lineage Screen **TIP** As you might have noticed on the screenshots above, Datafold does not display Tableau **Sheets**. Instead, we group, and deduplicate, all **Fields** of all **Sheets** within a **Workbook** and display them as **Fields** of the **Workbook**. On the screenshot directly above, `Demo Workbook` might include one **Sheet** with `Created At` field and another with `Sub Plan` field, but for our purposes we unite all of those fields beneath the **Workbook** — which makes the Lineage graph much less cluttered, and much easier to browse ## FAQ Lineage is only supported for Tableau assets in Live mode. Assets in Extract mode will not appear in Datafold lineage or dependency views. Datafold retrieves Tableau metadata using the Tableau API, which may not immediately reflect recent changes due to internal caching. If your updates aren’t showing up in Datafold, give it a few hours — they should appear once Tableau refreshes its metadata. --- # Source: https://docs.datafold.com/integrations/databases/teradata.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Teradata **INFO** Column-level Lineage is not currently supported for Teradata. **Steps to complete:** 1. [Configure user in Teradata](#configure-user-in-tedadata) 2. [Create a temporary database for Datafold](#create-a-temporary-database-for-datafold) 3. [Configure data connection in Datafold](#configure-in-datafold) ## Configure user in Teradata To connect to Teradata, create a user with read-only access to all databases you may wish to diff, including the login database: ``` CREATE USER DATAFOLD AS PERMANENT=1000000000 BYTES PASSWORD= COLLATION = ASCII TIME ZONE ='GMT'; GRANT EXECUTE FUNTION ON DB1 TO DATAFOLD; GRANT SELECT ON DB1 TO DATAFOLD; ... GRANT SELECT ON DB9 TO DATAFOLD; ``` ## Create a temporary database for Datafold Datafold requires a database to store temporary data with full permissions: ``` CREATE DATABASE DATAFOLD_TMP AS PERMANENT=10000000000 BYTES; GRANT ALL ON DATAFOLD_TMP TO DATAFOLD; ``` ## Configure data connection in Datafold | Field Name | Description | | ----------------------------- | -------------------------------------------------------------------------------------------------------------------------- | | Connection Name | A name given to the data connection within Datafold. | | Host | The hostname for your Teradata instance (e.g., account-name-2e3ba8b32qac9d.env.clearscape.teradata.com for Teradata SaaS). | | Port | Teradata endpoint port; the default value is 1025. | | User ID | User ID, e.g., DATAFOLD. | | Password | Password from above. | | Database | The connection database, e.g., DB1 from above. | | Database for Temporary Tables | The temporary database, e.g., DATAFOLD\_TMP from above. | Click **Create**. Your data connection is now ready! --- # Source: https://docs.datafold.com/api-reference/data-sources/test-a-data-source-connection.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Test a data source connection ## OpenAPI ````yaml post /api/v1/data_sources/{data_source_id}/test openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/data_sources/{data_source_id}/test: post: tags: - Data sources summary: Test a data source connection operationId: test_data_source_api_v1_data_sources__data_source_id__test_post parameters: - in: path name: data_source_id required: true schema: title: Data source id type: integer responses: '200': content: application/json: schema: $ref: '#/components/schemas/TestDataSourceAsyncResponse' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: TestDataSourceAsyncResponse: properties: job_id: title: Job Id type: integer required: - job_id title: TestDataSourceAsyncResponse type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/deployment-testing/configuration/model-specific-ci/time-travel.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Time Travel > Use `prod_time_travel` and `pr_time_travel` to diff tables from specific points in time. If your database supports time travel, you can diff tables from a particular point in time by specifying `prod_time_travel` for a production model and `pr_time_travel` for a PR model. ```Bash theme={null} models: - name: users meta: datafold: datadiff: prod_time_travel: - 2022-02-07T00:00:00 pr_time_travel: - 2022-02-07T00:00:00 ``` --- # Source: https://docs.datafold.com/api-reference/monitors/toggle-a-monitor.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Toggle a Monitor ## OpenAPI ````yaml openapi-public.json put /api/v1/monitors/{id}/toggle openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/monitors/{id}/toggle: put: tags: - Monitors summary: Toggle a Monitor operationId: toggle_monitor_on_off_api_v1_monitors__id__toggle_put parameters: - description: The unique identifier of the monitor. in: path name: id required: true schema: description: The unique identifier of the monitor. title: Id type: integer requestBody: content: application/json: schema: $ref: >- #/components/schemas/Body_toggle_monitor_on_off_api_v1_monitors__id__toggle_put required: true responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: Body_toggle_monitor_on_off_api_v1_monitors__id__toggle_put: properties: enabled: description: Indicate whether to enable or disable the monitor. title: Enabled type: boolean required: - enabled title: Body_toggle_monitor_on_off_api_v1_monitors__id__toggle_put type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/integrations/bi-data-apps/tracking-jobs.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Tracking Jobs > Track the completion and success of your data app integration syncs. To track the progress of your data app integration, go to the ** Jobs** tab in the left sidebar. Your **Search** and **Lineage** features will be available once you see a job marked as `Done` for your integration on this screen. **INFO** After the initial sync, Datafold will automatically re-sync every hour to keep your Data App assets up to date. --- # Source: https://docs.datafold.com/api-reference/ci/trigger-a-prmr-run.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Trigger a PR/MR run ## OpenAPI ````yaml post /api/v1/ci/{ci_config_id}/trigger openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/ci/{ci_config_id}/trigger: post: tags: - CI summary: Trigger a PR/MR run operationId: trigger_ci_api_v1_ci__ci_config_id__trigger_post parameters: - in: path name: ci_config_id required: true schema: title: CI config id type: integer requestBody: content: application/json: schema: $ref: '#/components/schemas/APIPrDetails' required: true responses: '200': content: application/json: schema: $ref: '#/components/schemas/SubmitCiJob' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: APIPrDetails: properties: base_branch: title: Base Branch type: string base_sha: title: Base Sha type: string pr_branch: title: Pr Branch type: string pr_number: title: Pr Number type: integer pr_sha: title: Pr Sha type: string required: - pr_number - pr_branch - base_branch - pr_sha - base_sha title: APIPrDetails type: object SubmitCiJob: properties: ci_run_id: anyOf: - type: integer - type: 'null' title: Ci Run Id run_id: anyOf: - type: integer - type: 'null' title: Run Id title: SubmitCiJob type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/monitors/trigger-a-run.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Trigger a run ## OpenAPI ````yaml openapi-public.json post /api/v1/monitors/{id}/run openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/monitors/{id}/run: post: tags: - Monitors summary: Trigger a run operationId: start_monitor_run_api_v1_monitors__id__run_post parameters: - description: The unique identifier of the monitor for which to start the run. in: path name: id required: true schema: description: The unique identifier of the monitor for which to start the run. title: Id type: integer responses: '200': content: application/json: schema: $ref: '#/components/schemas/ApiPublicMonitorTriggerRunResultOut' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ApiPublicMonitorTriggerRunResultOut: properties: run_id: description: Unique identifier for the monitor run result. title: Run Id type: integer required: - run_id title: ApiPublicMonitorTriggerRunResultOut type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/integrations/databases/trino.md # Trino **INFO** Lineage is not currently supported for Trino. **Steps to complete:** 1. [Configure user in Trino](#configure-user-in-trino) 2. [Create schema for Datafold](#create-schema-for-datafold) 3. [Configure your data connection in Datafold](#configure-in-datafold) ## Configure user in Trino To connect to Trino, create a user with read-only access to all data sources you wish to diff. Datafold also requires a schema set up with read/write permissions within one of the catalogs. ## Create schema for Datafold Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in the your warehouse. ## Configure in Datafold | Field Name | Description | | --------------------------- | ------------------------------------------------------------ | | Connection name | A name given to the data connection within Datafold. | | Host | The hostname for your trino instance. | | Port | Trino endpoint port; default value is 443. | | Encryption | Should be checked, possibly unchecked for local deployments. | | User ID | User ID as created in Trino. | | Password | Password, as created in Trino. | | Schema for temporary tables | Use `.` format. | Click **Create**. Your data source is now ready! --- # Source: https://docs.datafold.com/api-reference/data-diffs/update-a-data-diff.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Update a data diff ## OpenAPI ````yaml patch /api/v1/datadiffs/{datadiff_id} openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/datadiffs/{datadiff_id}: patch: tags: - Data diffs summary: Update a data diff operationId: update_datadiff_api_v1_datadiffs__datadiff_id__patch parameters: - in: path name: datadiff_id required: true schema: title: Data diff id type: integer requestBody: content: application/json: schema: $ref: >- #/components/schemas/Body_update_datadiff_api_v1_datadiffs__datadiff_id__patch responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: Body_update_datadiff_api_v1_datadiffs__datadiff_id__patch: properties: archived: anyOf: - type: boolean - type: 'null' title: Archived purged: anyOf: - type: boolean - type: 'null' title: Purged title: Body_update_datadiff_api_v1_datadiffs__datadiff_id__patch type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/bi/update-a-dbt-bi-integration.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Update a DBT BI integration > Returns the integration with changed fields. ## OpenAPI ````yaml put /api/v1/lineage/bi/dbt/{bi_datasource_id}/ openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineage/bi/dbt/{bi_datasource_id}/: put: tags: - BI - bi_modified summary: Update a DBT BI integration description: Returns the integration with changed fields. operationId: update_dbt_integration_api_v1_lineage_bi_dbt__bi_datasource_id___put parameters: - in: path name: bi_datasource_id required: true schema: title: Tableau integration id type: integer requestBody: content: application/json: schema: $ref: '#/components/schemas/DbtDataSourceConfig' required: true responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: DbtDataSourceConfig: properties: ci_config_id: title: Ci Config Id type: integer indexing_cron: anyOf: - type: string - type: 'null' title: Indexing Cron name: anyOf: - type: string - type: 'null' title: Name required: - ci_config_id title: DbtDataSourceConfig type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/bi/update-a-hightouch-integration.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Update a Hightouch integration > It can only update the schedule. Returns the integration with changed fields. ## OpenAPI ````yaml put /api/v1/lineage/bi/hightouch/{bi_datasource_id}/ openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineage/bi/hightouch/{bi_datasource_id}/: put: tags: - BI - bi_modified summary: Update a Hightouch integration description: >- It can only update the schedule. Returns the integration with changed fields. operationId: >- change_hightouch_integration_api_v1_lineage_bi_hightouch__bi_datasource_id___put parameters: - in: path name: bi_datasource_id required: true schema: title: Hightouch integration id type: integer requestBody: content: application/json: schema: $ref: '#/components/schemas/HighTouchDataSourceConfig' required: true responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: HighTouchDataSourceConfig: properties: bindings: items: $ref: '#/components/schemas/DataSourceBinding' title: Bindings type: array indexing_cron: anyOf: - type: string - type: 'null' title: Indexing Cron name: anyOf: - type: string - type: 'null' title: Name token: format: password title: Token type: string writeOnly: true workspace: anyOf: - type: string - type: 'null' title: Workspace required: - token - bindings title: HighTouchDataSourceConfig type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object DataSourceBinding: properties: boundIds: items: type: integer title: Boundids type: array remoteId: title: Remoteid type: string required: - remoteId - boundIds title: DataSourceBinding type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/bi/update-a-looker-integration.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Update a Looker integration > It can only update the schedule. Returns the integration with changed fields. ## OpenAPI ````yaml put /api/v1/lineage/bi/looker/{bi_datasource_id}/ openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineage/bi/looker/{bi_datasource_id}/: put: tags: - BI - bi_modified summary: Update a Looker integration description: >- It can only update the schedule. Returns the integration with changed fields. operationId: >- update_looker_integration_api_v1_lineage_bi_looker__bi_datasource_id___put parameters: - in: path name: bi_datasource_id required: true schema: title: Looker integration id type: integer requestBody: content: application/json: schema: $ref: '#/components/schemas/LookerDataSourceConfig' required: true responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: LookerDataSourceConfig: properties: base_url: title: Base Url type: string bindings: default: [] items: $ref: '#/components/schemas/DataSourceBinding' title: Bindings type: array client_id: title: Client Id type: string client_secret: format: password title: Client Secret type: string writeOnly: true indexing_cron: anyOf: - type: string - type: 'null' title: Indexing Cron name: anyOf: - type: string - type: 'null' title: Name project_ids: default: [] items: type: string title: Project Ids type: array repo_id: title: Repo Id type: integer required: - base_url - client_id - repo_id - client_secret title: LookerDataSourceConfig type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object DataSourceBinding: properties: boundIds: items: type: integer title: Boundids type: array remoteId: title: Remoteid type: string required: - remoteId - boundIds title: DataSourceBinding type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/bi/update-a-mode-analytics-integration.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Update a Mode Analytics integration > It can only update the schedule. Returns the integration with changed fields. ## OpenAPI ````yaml put /api/v1/lineage/bi/mode/{bi_datasource_id}/ openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineage/bi/mode/{bi_datasource_id}/: put: tags: - BI - bi_modified summary: Update a Mode Analytics integration description: >- It can only update the schedule. Returns the integration with changed fields. operationId: change_mode_integration_api_v1_lineage_bi_mode__bi_datasource_id___put parameters: - in: path name: bi_datasource_id required: true schema: title: Mode integration id type: integer requestBody: content: application/json: schema: $ref: '#/components/schemas/ModeDataSourceConfig' required: true responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: ModeDataSourceConfig: properties: indexing_cron: anyOf: - type: string - type: 'null' title: Indexing Cron name: anyOf: - type: string - type: 'null' title: Name password: format: password title: Password type: string writeOnly: true token: format: password title: Token type: string writeOnly: true workspace: default: '' title: Workspace type: string required: - token - password title: ModeDataSourceConfig type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/monitors/update-a-monitor.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Update a Monitor ## OpenAPI ````yaml openapi-public.json patch /api/v1/monitors/{id}/update openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/monitors/{id}/update: patch: tags: - Monitors summary: Update a Monitor operationId: update_monitor_api_v1_monitors__id__update_patch parameters: - description: The unique identifier of the monitor. in: path name: id required: true schema: description: The unique identifier of the monitor. title: Id type: integer requestBody: content: application/json: schema: discriminator: mapping: diff: '#/components/schemas/DataDiffUpdateMonitorSpecPublic' metric: '#/components/schemas/MetricUpdateMonitorSpecPublic' schema: '#/components/schemas/SchemaChangeUpdateMonitorSpecPublic' test: '#/components/schemas/DataTestUpdateMonitorSpecPublic' propertyName: type oneOf: - $ref: '#/components/schemas/DataDiffUpdateMonitorSpecPublic' - $ref: '#/components/schemas/MetricUpdateMonitorSpecPublic' - $ref: '#/components/schemas/SchemaChangeUpdateMonitorSpecPublic' - $ref: '#/components/schemas/DataTestUpdateMonitorSpecPublic' required: true responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: DataDiffUpdateMonitorSpecPublic: properties: alert: anyOf: - $ref: >- #/components/schemas/datafold__monitors__schemas__DiffAlertCondition - type: 'null' description: Condition for triggering alerts based on the data diff. datadiff: anyOf: - discriminator: mapping: indb: '#/components/schemas/InDbUpdateDataDiffConfig' inmem: '#/components/schemas/InMemUpdateDataDiffConfig' propertyName: diff_type oneOf: - $ref: '#/components/schemas/InDbUpdateDataDiffConfig' - $ref: '#/components/schemas/InMemUpdateDataDiffConfig' - type: 'null' description: Configuration for the data diff. description: anyOf: - type: string - type: 'null' description: The description of the monitor. title: Description name: anyOf: - type: string - type: 'null' description: The name of the monitor. title: Name notifications: description: Notification configuration for the monitor. items: discriminator: mapping: email: '#/components/schemas/EmailNotification' pagerduty: '#/components/schemas/PagerDutyNotification' slack: '#/components/schemas/SlackNotification' teams: '#/components/schemas/TeamsNotification' webhook: '#/components/schemas/WebhookNotification' propertyName: type oneOf: - $ref: '#/components/schemas/EmailNotification' - $ref: '#/components/schemas/PagerDutyNotification' - $ref: '#/components/schemas/WebhookNotification' - $ref: '#/components/schemas/SlackNotification' - $ref: '#/components/schemas/TeamsNotification' title: Notifications type: array schedule: anyOf: - $ref: '#/components/schemas/IntervalSchedule' - $ref: '#/components/schemas/CronSchedule' - $ref: '#/components/schemas/NoneSchedule' - type: 'null' description: The schedule at which the monitor runs. tags: description: Tags associated with the monitor. items: type: string title: Tags type: array type: const: diff description: The type of monitor. title: Type type: string required: - type title: Diff type: object MetricUpdateMonitorSpecPublic: properties: alert: anyOf: - discriminator: mapping: absolute: '#/components/schemas/AbsoluteThreshold' automatic: '#/components/schemas/AnomalyDetectionThreshold' percentage: '#/components/schemas/PercentageThreshold' propertyName: type oneOf: - $ref: '#/components/schemas/AnomalyDetectionThreshold' - $ref: '#/components/schemas/AbsoluteThreshold' - $ref: '#/components/schemas/PercentageThreshold' - type: 'null' description: Condition for triggering alerts. description: anyOf: - type: string - type: 'null' description: The description of the monitor. title: Description name: anyOf: - type: string - type: 'null' description: The name of the monitor. title: Name notifications: description: Notification configuration for the monitor. items: discriminator: mapping: email: '#/components/schemas/EmailNotification' pagerduty: '#/components/schemas/PagerDutyNotification' slack: '#/components/schemas/SlackNotification' teams: '#/components/schemas/TeamsNotification' webhook: '#/components/schemas/WebhookNotification' propertyName: type oneOf: - $ref: '#/components/schemas/EmailNotification' - $ref: '#/components/schemas/PagerDutyNotification' - $ref: '#/components/schemas/WebhookNotification' - $ref: '#/components/schemas/SlackNotification' - $ref: '#/components/schemas/TeamsNotification' title: Notifications type: array schedule: anyOf: - $ref: '#/components/schemas/IntervalSchedule' - $ref: '#/components/schemas/CronSchedule' - $ref: '#/components/schemas/NoneSchedule' - type: 'null' description: The schedule at which the monitor runs. tags: description: Tags associated with the monitor. items: type: string title: Tags type: array type: const: metric description: The type of monitor. title: Type type: string required: - type title: Metric type: object SchemaChangeUpdateMonitorSpecPublic: properties: description: anyOf: - type: string - type: 'null' description: The description of the monitor. title: Description name: anyOf: - type: string - type: 'null' description: The name of the monitor. title: Name notifications: description: Notification configuration for the monitor. items: discriminator: mapping: email: '#/components/schemas/EmailNotification' pagerduty: '#/components/schemas/PagerDutyNotification' slack: '#/components/schemas/SlackNotification' teams: '#/components/schemas/TeamsNotification' webhook: '#/components/schemas/WebhookNotification' propertyName: type oneOf: - $ref: '#/components/schemas/EmailNotification' - $ref: '#/components/schemas/PagerDutyNotification' - $ref: '#/components/schemas/WebhookNotification' - $ref: '#/components/schemas/SlackNotification' - $ref: '#/components/schemas/TeamsNotification' title: Notifications type: array schedule: anyOf: - $ref: '#/components/schemas/IntervalSchedule' - $ref: '#/components/schemas/CronSchedule' - $ref: '#/components/schemas/NoneSchedule' - type: 'null' description: The schedule at which the monitor runs. tags: description: Tags associated with the monitor. items: type: string title: Tags type: array type: const: schema description: The type of monitor. title: Type type: string required: - type title: Schema type: object DataTestUpdateMonitorSpecPublic: properties: description: anyOf: - type: string - type: 'null' description: The description of the monitor. title: Description name: anyOf: - type: string - type: 'null' description: The name of the monitor. title: Name notifications: description: Notification configuration for the monitor. items: discriminator: mapping: email: '#/components/schemas/EmailNotification' pagerduty: '#/components/schemas/PagerDutyNotification' slack: '#/components/schemas/SlackNotification' teams: '#/components/schemas/TeamsNotification' webhook: '#/components/schemas/WebhookNotification' propertyName: type oneOf: - $ref: '#/components/schemas/EmailNotification' - $ref: '#/components/schemas/PagerDutyNotification' - $ref: '#/components/schemas/WebhookNotification' - $ref: '#/components/schemas/SlackNotification' - $ref: '#/components/schemas/TeamsNotification' title: Notifications type: array query: anyOf: - type: string - type: 'null' description: The SQL query to be evaluated. title: Query schedule: anyOf: - $ref: '#/components/schemas/IntervalSchedule' - $ref: '#/components/schemas/CronSchedule' - $ref: '#/components/schemas/NoneSchedule' - type: 'null' description: The schedule at which the monitor runs. tags: description: Tags associated with the monitor. items: type: string title: Tags type: array test: anyOf: - $ref: '#/components/schemas/StandardDataTestMonitorSpec' - type: 'null' type: const: test description: The type of monitor. title: Type type: string required: - type title: Data type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object datafold__monitors__schemas__DiffAlertCondition: properties: different_rows_count: anyOf: - type: integer - type: 'null' description: >- Threshold for the number of different rows allowed between the datasets. title: Different Rows Count different_rows_percent: anyOf: - type: integer - type: 'null' description: >- Threshold for the percentage of different rows allowed between the datasets. title: Different Rows Percent title: Diff Conditions type: object InDbUpdateDataDiffConfig: properties: column_remapping: anyOf: - additionalProperties: type: string type: object - type: 'null' description: Mapping of columns from one dataset to another for comparison. title: Column Remapping columns_to_compare: anyOf: - items: type: string type: array - type: 'null' description: Optional list of columns to compare between the datasets. title: Columns To Compare ignore_string_case: anyOf: - type: boolean - type: 'null' description: Indicates whether to ignore case differences in string comparisons. title: Ignore String Case materialize_results: anyOf: - type: boolean - type: 'null' description: Indicates whether to materialize the results of the comparison. title: Materialize Results primary_key: description: List of columns that make up the primary key for the datasets. items: type: string title: Primary Key type: array sampling: anyOf: - $ref: '#/components/schemas/ToleranceBasedSampling' - $ref: '#/components/schemas/PercentageSampling' - $ref: '#/components/schemas/MaxRowsSampling' - type: 'null' description: Sampling configuration for the data comparison. timeseries_dimension_column: anyOf: - type: string - type: 'null' description: Column used for time series dimensioning in the comparison. title: Timeseries Dimension Column tolerance: anyOf: - $ref: '#/components/schemas/DataDiffToleranceConfig' - type: 'null' description: Configuration for tolerance applied to FLOAT columns. title: In-Database type: object InMemUpdateDataDiffConfig: properties: column_remapping: anyOf: - additionalProperties: type: string type: object - type: 'null' description: Mapping of columns from one dataset to another for comparison. title: Column Remapping columns_to_compare: anyOf: - items: type: string type: array - type: 'null' description: Optional list of columns to compare between the datasets. title: Columns To Compare ignore_string_case: default: false description: Indicates whether to ignore case differences in string comparisons. title: Ignore String Case type: boolean materialize_results: anyOf: - type: boolean - type: 'null' description: Indicates whether to materialize the results of the comparison. title: Materialize Results materialize_results_to: anyOf: - type: integer - type: 'null' description: Identifier for the destination where results should be materialized. title: Materialize Results To primary_key: description: List of columns that make up the primary key for the datasets. items: type: string title: Primary Key type: array sampling: anyOf: - $ref: '#/components/schemas/ToleranceBasedSampling' - $ref: '#/components/schemas/PercentageSampling' - $ref: '#/components/schemas/MaxRowsSampling' - type: 'null' description: Sampling configuration for the data comparison. title: Sampling tolerance: anyOf: - $ref: '#/components/schemas/DataDiffToleranceConfig' - type: 'null' description: Configuration for tolerance. title: In-Memory type: object EmailNotification: properties: features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features recipients: description: A list of email addresses to receive the notification. items: type: string title: Recipients type: array type: const: email default: email title: Type type: string required: - recipients title: Email type: object PagerDutyNotification: properties: features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer type: const: pagerduty default: pagerduty title: Type type: string required: - integration title: PagerDuty type: object WebhookNotification: properties: features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer type: const: webhook default: webhook title: Type type: string required: - integration title: Webhook type: object SlackNotification: properties: channel: description: The channel through which the notification will be sent. title: Channel type: string features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer mentions: description: A list of mentions to include in the notification. items: type: string title: Mentions type: array type: const: slack default: slack title: Type type: string required: - integration - channel title: Slack type: object TeamsNotification: properties: channel: description: The channel through which the notification will be sent. title: Channel type: string features: anyOf: - items: $ref: '#/components/schemas/DestinationFeatures' type: array - type: 'null' description: A list of features to enable for this notification. title: Features integration: description: The identifier for the integration. title: Integration type: integer mentions: description: A list of mentions names to include in the notification. items: type: string title: Mentions type: array type: const: teams default: teams title: Type type: string required: - integration - channel title: Teams type: object IntervalSchedule: properties: interval: anyOf: - $ref: '#/components/schemas/HourIntervalSchedule' - $ref: '#/components/schemas/DayIntervalSchedule' description: Specifies the scheduling interval. required: - interval title: Interval type: object CronSchedule: properties: cron: description: The cron expression that defines the schedule. title: Cron type: string type: const: crontab default: crontab title: Type type: string required: - cron title: Cron type: object NoneSchedule: properties: type: const: none default: none title: Type type: string title: None type: object AnomalyDetectionThreshold: properties: sensitivity: description: Sensitivity level for anomaly detection, ranging from 0 to 100. maximum: 100 minimum: 0 title: Sensitivity type: integer type: const: automatic title: Type type: string required: - type - sensitivity title: Anomaly Detection type: object AbsoluteThreshold: properties: max: anyOf: - type: number - type: 'null' description: Maximum value for the absolute threshold. title: Max min: anyOf: - type: number - type: 'null' description: Minimum value for the absolute threshold. title: Min type: const: absolute title: Type type: string required: - type title: Absolute type: object PercentageThreshold: properties: decrease: anyOf: - type: number - type: integer - type: 'null' description: Threshold for allowable percentage decrease. title: Decrease increase: anyOf: - type: number - type: integer - type: 'null' description: Threshold for allowable percentage increase. title: Increase type: const: percentage title: Type type: string required: - type title: Percentage type: object StandardDataTestMonitorSpec: properties: tables: anyOf: - items: $ref: '#/components/schemas/SDTTable' type: array - type: 'null' title: Tables type: $ref: '#/components/schemas/StandardDataTestTypes' variables: anyOf: - additionalProperties: $ref: '#/components/schemas/SDTVariable' type: object - type: 'null' title: Variables required: - type title: Standard DT type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object ToleranceBasedSampling: properties: confidence: description: The confidence level for the sampling results. title: Confidence type: number threshold: anyOf: - type: integer - type: 'null' description: Threshold for triggering actions based on sampling. title: Threshold tolerance: description: The allowable margin of error for sampling. title: Tolerance type: number required: - tolerance - confidence title: Tolerance type: object PercentageSampling: properties: rate: description: The sampling rate as a percentage. title: Rate type: number threshold: anyOf: - type: integer - type: 'null' description: Threshold for triggering actions based on sampling. title: Threshold required: - rate title: Percentage type: object MaxRowsSampling: properties: max_rows: description: The maximum number of rows to sample. title: Max Rows type: integer threshold: anyOf: - type: integer - type: 'null' description: Threshold for triggering actions based on sampling. title: Threshold required: - max_rows title: MaxRows type: object DataDiffToleranceConfig: properties: float: anyOf: - $ref: '#/components/schemas/ColumnToleranceConfig' - type: 'null' description: Configuration for float columns tolerance. title: DataDiffToleranceConfig type: object DestinationFeatures: enum: - attach_csv - notify_first_triggered_only - disable_recovery_notifications - notify_every_run title: DestinationFeatures type: string HourIntervalSchedule: properties: every: const: hour title: Every type: string type: const: hourly default: hourly title: Type type: string required: - every title: Hour type: object DayIntervalSchedule: properties: every: const: day title: Every type: string hour: anyOf: - type: integer - type: 'null' description: The hour at which the monitor should trigger. (0 - 23) title: Hour type: const: daily default: daily title: Type type: string utc_at: anyOf: - format: time type: string - type: 'null' description: The UTC time at which the monitor should trigger. title: Utc At required: - every title: Day type: object SDTTable: properties: columns: items: type: string title: Columns type: array path: title: Path type: string required: - path - columns title: SDTTable type: object StandardDataTestTypes: enum: - unique - not_null - accepted_values - referential_integrity - numeric_range - custom_template title: StandardDataTestTypes type: string SDTVariable: properties: quote: default: true title: Quote type: boolean value: anyOf: - type: string - type: integer - type: number - items: type: string type: array - items: type: integer type: array - items: type: number type: array - items: anyOf: - type: string - type: integer - type: number type: array title: Value required: - value title: SDTVariable type: object ColumnToleranceConfig: properties: column_tolerance: anyOf: - additionalProperties: discriminator: mapping: absolute: '#/components/schemas/AbsoluteColumnTolerance' relative: '#/components/schemas/RelativeColumnTolerance' propertyName: type oneOf: - $ref: '#/components/schemas/RelativeColumnTolerance' - $ref: '#/components/schemas/AbsoluteColumnTolerance' type: object - type: 'null' description: Specific tolerance per column. title: Column Tolerance default: anyOf: - discriminator: mapping: absolute: '#/components/schemas/AbsoluteColumnTolerance' relative: '#/components/schemas/RelativeColumnTolerance' propertyName: type oneOf: - $ref: '#/components/schemas/RelativeColumnTolerance' - $ref: '#/components/schemas/AbsoluteColumnTolerance' - type: 'null' description: Default tolerance applied to all columns. title: Default title: ColumnToleranceConfig type: object RelativeColumnTolerance: properties: type: const: relative default: relative description: The type of Column Tolerance. title: Type type: string value: anyOf: - type: number - type: integer description: Value of Column Tolerance. title: Value required: - value title: Relative type: object AbsoluteColumnTolerance: properties: type: const: absolute default: absolute description: The type of Column Tolerance. title: Type type: string value: description: Value of Column Tolerance. title: Value type: number required: - value title: Absolute type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/bi/update-a-power-bi-integration.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Update a Power BI integration > Updates the integration configuration. Returns the integration with changed fields. ## OpenAPI ````yaml openapi-public.json put /api/v1/lineage/bi/powerbi/{bi_datasource_id}/ openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineage/bi/powerbi/{bi_datasource_id}/: put: tags: - BI - BI - bi_modified summary: Update a Power BI integration description: >- Updates the integration configuration. Returns the integration with changed fields. operationId: >- update_powerbi_integration_api_v1_lineage_bi_powerbi__bi_datasource_id___put parameters: - in: path name: bi_datasource_id required: true schema: title: Power BI integration id type: integer requestBody: content: application/json: schema: $ref: '#/components/schemas/PowerBIDataSourceConfig' required: true responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: PowerBIDataSourceConfig: description: Power BI data source parameters. properties: auth_type: anyOf: - $ref: '#/components/schemas/PowerBIAuthType' - type: 'null' client_id: anyOf: - type: string - type: 'null' title: Client Id client_secret: anyOf: - format: password type: string writeOnly: true - type: 'null' title: Client Secret indexing_cron: anyOf: - type: string - type: 'null' title: Indexing Cron name: anyOf: - type: string - type: 'null' title: Name tenant_id: anyOf: - type: string - type: 'null' title: Tenant Id title: PowerBIDataSourceConfig type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object PowerBIAuthType: enum: - delegated - service_principal title: PowerBIAuthType type: string ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/bi/update-a-tableau-integration.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Update a Tableau integration > It can only update the schedule. Returns the integration with changed fields. ## OpenAPI ````yaml put /api/v1/lineage/bi/tableau/{bi_datasource_id}/ openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/lineage/bi/tableau/{bi_datasource_id}/: put: tags: - BI - bi_modified summary: Update a Tableau integration description: >- It can only update the schedule. Returns the integration with changed fields. operationId: >- update_tableau_integration_api_v1_lineage_bi_tableau__bi_datasource_id___put parameters: - in: path name: bi_datasource_id required: true schema: title: Tableau integration id type: integer requestBody: content: application/json: schema: $ref: '#/components/schemas/TableauDataSourceConfig' required: true responses: '200': content: application/json: schema: {} description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: TableauDataSourceConfig: properties: indexing_cron: anyOf: - type: string - type: 'null' title: Indexing Cron name: anyOf: - type: string - type: 'null' title: Name server_url: title: Server Url type: string site_id: title: Site Id type: string token_name: title: Token Name type: string token_value: format: password title: Token Value type: string writeOnly: true required: - token_name - token_value - site_id - server_url title: TableauDataSourceConfig type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/api-reference/ci/upload-prmr-changes.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Upload PR/MR changes ## OpenAPI ````yaml post /api/v1/ci/{ci_config_id}/{pr_num} openapi: 3.1.0 info: contact: email: support@datafold.com name: API Support description: >- The Datafold API reference is a guide to our available endpoints and authentication methods. If you're just getting started with Datafold, we recommend first checking out our [documentation](https://docs.datafold.com). :::info To use the Datafold API, you should first create a Datafold API Key, which should be stored as a local environment variable named DATAFOLD_API_KEY. This can be set in your Datafold Cloud's Settings under the Account page. ::: title: Datafold API version: latest servers: - description: Default server url: https://app.datafold.com security: - ApiKeyAuth: [] paths: /api/v1/ci/{ci_config_id}/{pr_num}: post: tags: - CI summary: Upload PR/MR changes operationId: upload_changes_api_v1_ci__ci_config_id___pr_num__post parameters: - in: path name: ci_config_id required: true schema: title: CI config id type: integer - in: path name: pr_num required: true schema: title: Pull request/Merge request number type: integer requestBody: content: application/json: schema: items: $ref: '#/components/schemas/CiDiff' title: Diffs type: array required: true responses: '200': content: application/json: schema: $ref: '#/components/schemas/SubmitCiJob' description: Successful Response '422': content: application/json: schema: $ref: '#/components/schemas/HTTPValidationError' description: Validation Error components: schemas: CiDiff: properties: exclude_columns: default: [] items: type: string title: Exclude Columns type: array include_columns: default: [] items: type: string title: Include Columns type: array pk: anyOf: - items: type: string type: array uniqueItems: true - type: 'null' title: Pk pr: title: Pr type: string prod: title: Prod type: string required: - prod - pr title: CiDiff type: object SubmitCiJob: properties: ci_run_id: anyOf: - type: integer - type: 'null' title: Ci Run Id run_id: anyOf: - type: integer - type: 'null' title: Run Id title: SubmitCiJob type: object HTTPValidationError: properties: detail: items: $ref: '#/components/schemas/ValidationError' title: Detail type: array title: HTTPValidationError type: object ValidationError: properties: loc: items: anyOf: - type: string - type: integer title: Location type: array msg: title: Message type: string type: title: Error Type type: string required: - loc - msg - type title: ValidationError type: object securitySchemes: ApiKeyAuth: description: Use the 'Authorization' header with the format 'Key ' in: header name: Authorization type: apiKey ```` --- # Source: https://docs.datafold.com/security/user-roles-and-permissions.md # User Roles and Permissions > Datafold uses role-based access control to manage user permissions and actions. Datafold has three default roles: | Role | Description | Permissions | | -------- | -------------- | ---------------------------------------------------------------------------------------------------- | | default | Full user role | Create and modify monitors, create diffs, explore data and lineage | | admin | Administrator | Default permissions plus the ability to manage users and configurations such as database connections | | viewonly | View-only role | View diffs and monitors without the ability to create or modify them | --- # Source: https://docs.datafold.com/integrations/databases/vertica.md # OpenText Analytics Database (Vertica) **INFO** Column-level Lineage is not supported for Vertica. **Steps to complete:** 1. [Run SQL script and create schema for Datafold](/integrations/databases/vertica#run-sql-script-and-create-schema-for-datafold) 2. [Configure your data connection in Datafold](/integrations/databases/vertica#configure-in-datafold) ## Run SQL script and create schema for Datafold To connect to Vertica, you need to create a user with read-only access to all tables in all schemas, write access to Datafold-specific schema for temporary tables: ```Bash theme={null} /* Datafold utilizes a temporary dataset to materialize scratch work and keep data processing in your warehouse. */ CREATE SCHEMA datafold_tmp; /* Create a datafold user */ CREATE ROLE datafold WITH LOGIN ENCRYPTED PASSWORD 'SOMESECUREPASSWORD'; /* Give the datafold role write access to the temporary schema */ GRANT ALL ON SCHEMA datafold_tmp TO datafold; /* Make sure that the user has read permissions on the tables */ GRANT USAGE ON SCHEMA TO datafold; GRANT SELECT ON ALL TABLES IN SCHEMA TO datafold; ``` Datafold utilizes a temporary schema, named `datafold_tmp` in the above script, to materialize scratch work and keep data processing in the your warehouse. ### Configure in Datafold | Field Name | Description | | ------------- | ------------------------------------------------------------------------- | | Name | A name given to the data connection within Datafold | | Host | The hostname address for your database; default value 127.0.0.1 | | Port | Vertica connection port; default value is 5433 | | User | The user role created in the SQL script; datafold | | Password | The password created in the SQL permissions script | | Database Name | The name of the Vertica database you want to connect to, default is VMart | Click **Create**. Your data connection is ready! --- # Source: https://docs.datafold.com/welcome.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # Welcome > Datafold is the unified platform proactive data quality that combines automated data testing, data reconciliation, and observability to help data teams prevent data quality issues and accelerate their development velocity. ## Why Datafold? Datafold automates the most error-prone and time-consuming aspects of the data engineering workflow by **preventing and detecting data quality issues**. In addition to standard observability features like monitoring, profiling, and lineage, we integrate deeply into the development cycle with automated CI/CD testing. This enables data teams to prevent bad code deployments and detect issues upstream of the data warehouse. Whether it's for [CI/CD testing](deployment-testing/how-it-works) or [data migration automation](data-migration-automation), Datafold ensures data quality at every stage of the data pipeline. ## Key features Data quality is a complex and multifaceted problem. Datafold’s unified platform helps embed proactive data quality testing in your workflows: Use value-level data diffs to isolate and identify changes in your data. Catch unintended modifications before they disrupt production or downstream data usage. Create monitors for data diffs, data quality metrics, SQL metrics, SQL rules, and schema changes to send alerts when inconsistencies are detected. Discover how DMA provides full-cycle migration automation with SQL code translation and cross-database validation. Learn how your data assets move and change across systems with column-level lineage, metadata, and profiles, to track the impacts of changes made upstream. ## Use cases Catch data quality issues early with automated testing during development and deployment. Speed up migrations with our full-cycle migration automation solution for data teams. Shift monitoring upstream to proactively prevent disruptions and ensure data quality. ## Getting started There are a few ways to get started with your first data diff: Once you’ve integrated a [data connection](/integrations) and [code repository](/integrations/code-repositories), you can run a new [in-database](/data-diff/in-database-diffing/creating-a-new-data-diff) or [cross-database](/data-diff/cross-database-diffing/creating-a-new-data-diff) data diff or explore your [data lineage](data-explorer/lineage). Create [monitors](data-monitoring/monitor-types) to send alerts when data diffs fall outside predefined ranges. Get started with deployment testing through our universal ([No-Code](deployment-testing/getting-started/universal/no-code), [API](deployment-testing/getting-started/universal/api)) or [dbt](integrations/orchestrators/dbt-core) integrations. ## Learn more Curious to learn more about why and how data quality matters? We wrote a whole guide (with illustrations of medieval castles, moats, and knights) called the [Data Quality Guide](https://www.datafold.com/data-quality-guide) which covers: * A practical roadmap towards creating a robust data quality system * Data quality metrics to keep, and metrics to ignore * Nurturing a strong data quality culture within and beyond data teams --- # Source: https://docs.datafold.com/data-diff/what-is-data-diff.md > ## Documentation Index > Fetch the complete documentation index at: https://docs.datafold.com/llms.txt > Use this file to discover all available pages before exploring further. # What's a Data Diff? > A data diff is the value-level comparison between two tables, used to identify critical changes to your data and guarantee data quality. When you **git diff** your code, you’re comparing two versions of your code files to see what has changed, such as lines added, removed, or modified. Similarly, a **data diff** compares two versions of a dataset or two databases to identify differences in individual cells in the data. what's a data diff ## Why do I need to diff data? Just as diffing code and text is fundamental to software engineering and working with text documents, diffing data is essential to the data engineering workflow. Why? In data engineering, both data and the code that processes it are constantly evolving. Without the ability to easily diff data, understanding and tracking data changes becomes challenging. This slows down the development process and makes it harder to ensure data quality. There is a lot you can do with data diff: * Test SQL code by comparing development or staging environment data to production * Compare tables in source and target systems to identify discrepancies when migrating data between databases * Detect value-level outliers, or unexpected changes, in data flowing through your ETL/ELT pipelines * Verify that reports generated for regulatory compliance accurately reflect the underlying data by comparing report outputs with source data ## Why Datafold? Data diffing is a fundamental capability in data engineering that every engineer should have access to. Datafold's [Data Diff](https://www.datafold.com/data-diff) is a tool that compares datasets fast, within or across databases. Datafold offers an enterprise-ready solution for comparing datasets within or across databases at scale. It includes comprehensive, optimized, and automated diffing solutions, API access, and secure deployment options. Here's how you can identify row-level discrepancies in Datafold: Datafold provides end-to-end solutions for automating testing, including column-level lineage, ML-based anomaly detection, and enterprise-scale infrastructure support. It caters to complex and production-ready scenarios, including: * Automated and collaborative diffing and testing for data transformations in CI * Data diffing informed by column-level lineage, and validation of code changes with visibility into BI applications * Validating large data migrations or continuous replications with automated cross-database diffing capabilities Here's a high-level overview of what Datafold offers: | Feature Category | Datafold | | :---------------------------------------------------------------------------------------------------------------------------: | :----------------------------------------------: | | **Database Support**
*Databases that are supported for source-destination diff* | Any SQL database, inquire about specific support | | **Scale**
*Size of datasets supported for diffing* | Unlimited with advanced performance optimization | | **Primary Key Data Type Support**
*Data types of primary keys that are supported for diffing* | Numerical, string, datetime, boolean, composite | | **Data Types Diffing Support**
*Data types that are supported for per-column diffing* | All data types | | **Export Diff Results to Database**
*Materialize diffing results in your database of choice* | | | **Value-level diffs**
*Investigate row-by-row column value differences between source and destination databases* | (JSON & GUI) | | **Diff UI**
*Explore diffs visually and easily share them with your team and stakeholders* | | | **API Access**
*Automatically create diffs and receive results at scale using the Datafold REST API* | | | **Persisting Diff History**
*Persist the result history of diffs to know how your data and diffs have changed over time* | | | **Scheduled Checks**
*Run scheduled diffs for a defined list of tables* | | | **Alerting**
*Receive automatic alerts about detected discrepancies between tables within or across databases* | | | **Security and Compliance**
*Run diffs in secure and compliant environments* | HIPAA, SOC2 Type II, GDPR compliant | | **Deployment Options**
*Deploy your diffs in secure environments that meet your security standards* | Multi-tenant SaaS or Single-tenant in VPC | | **Support**
*Choose which channels offer the greatest support to your use cases and users* | Enterprise support from Datafold team members | | **SLA**
*The types of SLAs that exist to guarantee your team can diff and interact with diffs as expected* | (Coming soon) | ## Three ways to learn more If you're new to Datafold or data diffing, here are three easy ways to get started: 1. **Explore our CI integration guides**: See how Datafold fits into your continuous integration (CI) pipeline by checking out our guides for [No-Code](../deployment-testing/getting-started/universal/no-code), [API](../deployment-testing/getting-started/universal/api), or [dbt](../integrations/orchestrators) integrations. 2. **Try it yourself**: Use your own data with our [14-day free trial](https://app.datafold.com/) and experience Datafold in action. 3. **Book a demo**: Get a deeper technical understanding of how Datafold integrates with your company’s data infrastructure by [booking a demo](https://www.datafold.com/booktime) with our team.