How To Build A Lightweight ETL Pipeline in Under An Hour

Extracting value from your Salesforce CRM data is imperative.

Salesforce does have good reporting capabilities, but by definition your data is siloed. You can look at your sales data, but not how it relates to your marketing analytics or product data. By extracting Salesforce data you can break down these data silos, enabling a unified view of your customer interactions.

That’s what we’re going to do today. Below is a step-by-step guide to setting up an ETL (Extract-Transform-Load) pipeline to extract data from Salesforce, transform it business metrics, such as sales rep value and sales cycle length, and load both raw data and transformed data into a database.

You can do this in under an hour. Probably a lot under. And by the end you’ll have immediate access to new versions of your Salesforce data for analysis.

What is an ELT/ETL Pipeline?

ETL stands for Extract, Transform, Load, and is the concept of:

Extracting data from various source systems. These systems can be databases, CRMs, file systems, cloud-based storage, and more. The extraction process pulls the data from these systems and prepares it for the next stage. In the context of sales data in Salesforce, the extraction process involves connecting to the Salesforce API to pull data from various sales-related objects such as Accounts, Opportunities, Leads, and Contacts. This requires an understanding of Salesforce's data model and the relationships between different objects.
Once the data is extracted, it is then transformed into a format that can be easily loaded into the target system, usually a data warehouse. This transformation process can involve several operations such as cleaning (handling missing data, removing duplicates, etc.), standardizing (converting data into a common format or data type), filtering (removing unnecessary data), aggregating (summarizing detailed data), and enriching (combining data from different sources). For instance, in sales data, you might have missing values for some fields in your leads or opportunities which you need to handle. Or sales data from Salesforce can be combined with data from other sources, like marketing data, to provide a more complete view of the customer journey.
After the transformation is complete, the data is loaded into the target system. This involves inserting new data and updating any existing data to reflect the most recent changes from the source system. Transformed sales data can be used for various types of analysis and reporting to provide insights into sales performance, forecast revenue, analyze customer behavior, and more.

Let’s go through the process then. Here, we are going to set up a simple ETL pipeline that you can get going in literally minutes. All you need is:

A Salesforce account
A Vessel account
A PostgreSQL database
Knowledge of JavaScript.

The E in ETL: Getting your data out of Salesforce

This is obviously the first step. We want our data out of Salesforce so that we can then a) perform any transformations on it, and b) store it elsewhere in any easier-to-query format.

For this, we have two options:

Use the native Salesforce APIs. In this case, we’d probably use two different APIs. For the initial extraction, we might use the Bulk API. This lets us extract a lot of data in one go that we can then load into another database or data warehouse. We would then schedule a daily API call with the REST API to continually update this dataset.
Use Vessel’s Salesforce integration. With the Vessel integration we can also pull batch data and then use the modifiedTime option to extract data that has changed since the last extraction.

We’re going to use Vessel. There’s another benefit to using Vessel here–we can also extract data from other services without having to add independent extraction APIs. This is a critical feature of ETL pipelines. The idea isn’t to just extract data from one place. It is to extract data from your entire suite of tools, transform them together, and save that combined data for better insight across your organization.

Check out our docs for how to authorize your Salesforce session. We won’t cover that here and instead dive straight into calling the unified API to grab your salesforce data. Here, we’ll use the Vessel SDK to list out all the deals that have happened today:

import Vessel from "@vesselapi/client-sdk";[...]app.post('/list-deals', async (req, res) => {  const vessel = Vessel({    apiToken: API_KEY,    accessToken,  });  const response = await vessel.unifications.crm.deals.list({    filters: {      modifiedTime: {        equals: '08/24/23',      },    },  });  const { result, error } = response.body;  if (error) throw new Error(error.message);  res.send({ deals: result });});

You can call this API endpoint every day to update your list of deals. That’s all you need to extract your data straight out of Salesforce.

The T in ETL: Transforming your data for better insight

Once you have extracted your data, you can start to manipulate it for insights into your business. This is called transformation, but really this part of the workflow can be broken down even further:

Data Cleaning: Raw data extracted from source systems can have numerous issues like missing values, incorrect data, duplicate records, or irrelevant data. Transformation steps can clean this data by filling in missing values, removing duplicates, correcting errors, and filtering out irrelevant data.
Data Standardization: Data from different sources can use different formats, units, or data types. Transformations can standardize the data by converting it to a common format, unit, or data type, making it easier to combine and analyze data from different sources.
Data Enrichment: Transformation can also involve combining data from multiple sources to create new data that provides additional insights. For example, you might combine sales data with demographic data to get insights into customer behavior.
Data Aggregation: Raw data can be too detailed for some types of analysis. For example, if you're analyzing daily sales, you don't need each individual sales transaction; instead, you need the total sales for each day. Transformations can aggregate detailed data into summary data.

dbt is becoming the common tool for transforming data. But it has a steep learning curve, so if you are looking to build an ETL pipeline quickly, its not the best option. It is the best option once you have that pipeline in place and need to solidify your transformations.

But you can do quick transformations on your data as it comes from Salesforce using custom code. Here, we’ll use in-built JavaScript methods to show how you can go from extracting the data from Salesforce in JS to transforming it in JS, to loading it in JS. But Python is a more robust language for data transformations and analytics.

Let’s say our Salesforce data has value and rep information for each deal, as well as the start date and the closed date for the deal. We can transform the raw data into aggregated metrics to help understand our sales cycle.

First, let’s calculate the total close value for each sales rep:

// 'data' is returned from the list-deals endpoint aboveconst totalCloseValueByRep = data.reduce((acc, record) => {  if (record.close_value) {    acc[record.sales_rep] = (acc[record.sales_rep] || 0) + record.close_value;  }  return acc;}, {});console.log(totalCloseValueByRep);

Which could output something like:

Anna Snelling: $335,790
Boris Faz: $185,729
Cassey Cress: $262,093
Zane Levy: $292,265

We can also do the same thing by product:

const totalCloseValueByProduct = data.reduce((acc, record) => {  if (record.close_value) {    acc[record.product] = (acc[record.product] || 0) + record.close_value;  }  return acc;}, {});console.log(totalCloseValueByProduct);

We would output:

GTK500U: $365,497
GTXBasic: $342,486
GTXPro: $2,482,438

Finally, we can work out the sales cycle for each deal and then calculate the average length of a closed won deal:

// first, work out each sales cycledata.forEach(record => {  if (record.created_on && record.close_date) {    const createdDate = new Date(record.created_on);    const closeDate = new Date(record.close_date);    const durationDays = (closeDate - createdDate) / (1000 * 60 * 60 * 24);    record.duration = durationDays;  } else {    record.duration = null;  }});console.log(data);//then get the average close durationconst wonOpportunities = data.filter(record => record.deal_stage === 'Won' && record.duration);const totalDuration = wonOpportunities.reduce((sum, record) => sum + record.duration, 0);const averageCloseDuration = totalDuration / wonOpportunities.length;console.log(`Average Close Duration for Won Opportunities: ${averageCloseDuration.toFixed(2)} days`);

Average Close Duration for Win Opportunities: 46.94 days.

From the data, it's evident that Anna is closing the most value, and "GTXPro" has the highest total close value indicating it is the most lucrative product. Additionally, on average, it takes about 47 days to close a successful deal. This information can be useful for sales teams to set expectations and prioritize which products to focus on.

We want to keep this data, so the next step is to load it into a database.

The L in ETL: Getting your data into storage

The final step is to keep all of your data in a single place. This can be a data warehouse like Snowflake, but it can also be a regular database like PostgreSQL or even S3 blob storage. When you are just pulling in data from a single source like Salesforce, this might seem obsolete. But when you are unifying data from sales, marketing, product, and engineering, having a single repository for all your data and all your transformations makes it easier to get the holistic view of your business.

Let’s get the data above into a Postgres database.

First, you'd need to install the node Postgress pg package:

npm install pg

Then, you can use the following code to save your data:

const { Client } = require('pg');// Define your database connection settingsconst client = new Client({    user: 'your_username',    host: 'localhost',    database: 'your_database_name',    password: 'your_password',    port: 5432,});// Connect to the databaseclient.connect();// Assuming data is your original dataset and transformedData contains the transformed valuesconst saveData = async () => {    try {        // Save the original data        data.forEach(async (record) => {            await client.query(                'INSERT INTO original_data(account, opportunity_id, sales_agent, deal_stage, product, close_date, close_value, created_on, week)'                + 'VALUES($1, $2, $3, $4, $5, $6, $7, $8, $9)'                , [record.account, record.opportunity_id, record.sales_agent, record.deal_stage, record.product, record.close_date, record.close_value, record.created_on, record.week]);        });        // Save the transformed data (e.g., total close value by rep)        for (let agent in totalCloseValueByRep) {            await client.query(                'INSERT INTO transformed_data_sales_rep(sales_rep, total_close_value)'                + 'VALUES($1, $2)'            , [agent, totalCloseValueByRep[rep]]);        }        // ... Similarly, you can save other transformations like totalCloseValueByProduct, averageCloseDuration, etc.    } catch (err) {        console.error('Error saving data:', err);    } finally {        // Close the database connection        client.end();    }};// Invoke the functionsaveData();

Finally, you’ll want to schedule this to run each day (make the modifiedTime a variable that updates each day!) so you can constantly update your transformations and data in your database.

To do this, you need to schedule a cron job, which is simply a script that runs at a scheduled time each day. You can use cloud cron job providers such as Cronhub, AWS Lambda with Scheduled Events, or Google Cloud Scheduler.

But you can also just run this on your local machine. On a Unix-like operating system, open the terminal and type crontab -e to edit the cron jobs for the current user.

Add a new line to schedule your script. For instance, to a file with our fetchDeals code in above every day at 3 am, you'd add:

0 3 * * * /usr/bin/node /path/to/your/script/fetchDeals.js >> /path/to/your/logfile.log 2>&1

And you have a scheduled ETL pipeline.

Building ETL into your pipeline

With just a little bit of JavaScript, you can get an entire ETL pipeline for your Salesforce data up and running in minutes. If you are using the Vessel unified API, then it will be trivial to add other data from your GTM stack and your product to this pipeline for more and more insights.

Insights is what this pipeline is about. It takes your siloed data and brings it together so you can see a holistic view of your operations. By integrating diverse data sources, you unlock a deeper understanding of customer journeys, optimize processes, and drive strategic decision-making.