Skip to main content
A common type of integration is to let users view & interact with data from external systems. Sometimes this requires that you sync many records from the external API to your app.

Context

When syncing data in Nango, there are actually two syncing processes happening:
  • Nango syncing with the external system, managed by integration Functions
  • Your app syncing with Nango, managed by your app’s code using the Nango API
Your app syncs with Nango in the same way regardless of the volume of the dataset (see implement a sync guide). Nango maintains a cache of the synced data, and the Nango API allows you to fetch only the incremental changes for your convenience. Below, we explore various approaches on how to sync data from the external system to Nango, even for large datasets.

Syncing without checkpoints (small datasets only)

For small datasets (e.g., a list of Slack users for organizations with less than 100 employees), you can instruct Nango to periodically poll for the entire dataset. Once the sync run finishes, Nango computes which records have been changed during the latest sync run, and sends a webhook to your backend. This lets you fetch only the changes from Nango to your application. As datasets increase in size, syncs that fetch all data become unscalable, taking longer to run, triggering rate limiting from external APIs, and consuming more compute and memory resources. Here is an example of a sync script which refreshes the entire list of Slack user on each execution:
export default createSync({
  exec: async (nango) => {
    // ...
    // Paginate API requests.
    while (nextPage) {
        const res = getUsersFromSlackAPI(nextPage);
        // ...
        // Save Slack users.
        await nango.batchSave(mapUsers(res.data.members), 'User');
    }
  },
});

Syncing with checkpoints

The preferred method for syncing larger datasets is to use checkpoints to fetch only changes from the external API since the last sync. Checkpoints allow you to save progress (e.g., a timestamp or cursor) and resume from there on the next run. This way, you only receive and persist the modified records in the Nango cache. For example, if you are syncing tens of thousands of contacts from a Salesforce account on an hourly basis, only a small portion of the contacts will be updated or created in any given hour. Without checkpoints, you would need to fetch the entire contact list every hour, which is inefficient. With checkpoints, you can fetch only the modified contacts from the past hour. Not all APIs, and not all endpoints on most APIs, support this. If the endpoint you need does not let you filter or sort by last modified date, you will need to sync without checkpoints. Here is an example of a sync script that updates the list of Salesforce contacts using checkpoints:
export default createSync({
  checkpoint: z.object({
    lastModifiedISO: z.string(),
  }),
  exec: async (nango) => {
    const checkpoint = await nango.getCheckpoint();
    // ...
    // Paginate API requests.
    while (nextPage) {
        // Pass in a timestamp to Salesforce to fetch only the recently modified data.
        const res = getContactsFromSalesforceAPI(checkpoint?.lastModifiedISO, nextPage);
        // ...
        const contacts = mapContacts(res.data.records);
        // Save Salesforce contacts.
        await nango.batchSave(contacts, 'Contact');
        // Save checkpoint with the last record's LastModifiedDate
        const lastContact = contacts[contacts.length - 1];
        await nango.saveCheckpoint({ lastModifiedISO: lastContact.LastModifiedDate });
    }
  },
});

Initial sync execution

Even with checkpoints, the very first sync execution has to fetch all data since there is no previous checkpoint. This initial sync fetches all historical data and is more resource-intensive than subsequent executions. One strategy to manage this is to limit the period you are backfilling. For example, if you are syncing a Notion workspace, you can inform users that you will only sync Notion pages modified in the last three months, assuming these are most relevant.

Avoiding memory overuse

Nango integration functions, which manage data syncing between external systems and Nango, run on customer-specific VMs with fixed resources. Consequently, integration functions can lead to sync failures (e.g., VM crash) when memory resources are overused. The most common cause of excessive memory use is fetching a large number of records before saving them to the Nango cache, as shown below:
export default createSync({
  checkpoint: z.object({ lastModifiedISO: z.string() }),
  exec: async (nango) => {
    const checkpoint = await nango.getCheckpoint();
    // ...
    const responses: any[] = [];
    // Paginate API requests.
    while (nextPage) {
        const res = getContactsFromSalesforceAPI(checkpoint?.lastModifiedISO, nextPage);
        // ...
        // Save all dataset in memory (memory intensive).
        responses.push(...res.data.records);
    }
    // Save all Salesforce contacts at once.
    await nango.batchSave(mapContacts(responses), 'Contact');
  },
});
A simple fix is to store records as you fetch them, allowing them to be released from memory:
export default createSync({
  checkpoint: z.object({ lastModifiedISO: z.string() }),
  exec: async (nango) => {
    const checkpoint = await nango.getCheckpoint();
    // ...
    // Paginate API requests.
    while (nextPage) {
        // Releases the previous page results from memory.
        const res = getContactsFromSalesforceAPI(checkpoint?.lastModifiedISO, nextPage);
        // ...
        const contacts = mapContacts(res.data.records);
        // Save Salesforce contacts after each page.
        await nango.batchSave(contacts, 'Contact');
        // Save checkpoint
        await nango.saveCheckpoint({ lastModifiedISO: contacts[contacts.length - 1].LastModifiedDate });
    }
  },
});
Combining a large dataset while storing the entire dataset in memory will likely cause periodic VM crashes. You can monitor the memory consumption of your functions executions in the Logs tab within the Nango UI.

Avoiding syncing unnecessary data

Another strategy for handling large datasets successfully is to filter the data you need as early as possible, either using filters available from the external API or by discarding data in the functions, i.e., not saving it to the Nango cache. This approach uses the external system as a source of truth, allowing you to sync additional data in the future by editing your Nango function and triggering a sync with reset: true to backfill any missing historical data. Because of the flexibility of integration functions, Nango allows you to perform transformations early in the data sync process, optimizing resource use and enabling faster syncing. You can also use customer-specific config to implement customer-specific filters in your sync functions.
Questions, problems, feedback? Please reach out in the Slack community.