SQL Basics: Solving Data Differentials With LEFT JOINs OnlysteemCreated with Sketch.

in Popular STEM3 years ago

One challenge we face with data involves new and existing data from a source data set to a destination data set. The source and destination may be managed by us or one of them could be managed by us. When we have these data sets, we will often get "updates" to these data - in some cases, these updates will involve new data that don't already exist or may involve existing data that need to be updated. In rarer cases (but still frequent), we may need to remove data that no longer exist in an update. This latter case is rarer because most "differentials" will involve adding new data and updating existing data. For the benefit of these "delete" scenario, I also cover it in the video, Using LEFT JOINs To Solve New and Existing Data Differentials, along with updating and adding new data. This means that this video shows how we can use LEFT JOINs to solve all three of these scenarios. These scenarios come up frequently in business so it's useful to know how we can use a join type to solve them. Keep in mind, this is a tutorial video to show you how this can be done, not a recommendation of whether you should use this or not.

When solving data differentials with LEFT JOINs, the operations matter because they first affect the order of our operations (lower cost to performance). For an example, if we don't need to delete any data from a source to a destination data set because new data don't have records, then we can proceed to order our insert or our update. If we must first delete records in the destination that no longer exist in the source, we'll want to do that first. The big reason is performance - we don't want to perform further operations on records that should no longer exist. As a note, deleting tends to be the operations that occurs less frequently with data differentials - though we sometimes will need to use it. An example of using a delete is if we managed data at a merchant conglomerate and a merchant no longer did business with us; we wouldn't want to keep the merchant data.

If we only need to add new data and update existing data (let's assume that data that no longer exist simply remain without being removed), we'll start with updating data. The reason for this also involves performance. If we have updated data from a source to a destination, we can update matches with a LEFT JOIN where certain columns have changed. If we inserted new data first, we'd have more matches. We always want to return the fewest rows possible, thus updating first is better for performance. Once we update the existing data that have changed, then we'll add the new data.

Can we only solve data differentials with LEFT JOINs? No. We can use merge operations, match data on the application layer, use except functionality supported in some SQL languages, etc. In this tutorial, we see the capability of LEFT JOINs, but they are not the only tool we can use to solve this problem. When we consider solving this problem, we want to consider two key points in development: performance and maintainability. Not every developer may be familiar with the tools we use to solve this problem - we'll want to consider how our solution is maintained. Likewise, we want to consider performance. Relative to our architecture, using LEFT JOINs may not be the most optimal solution. Unfortunately, there is no hard rule about how to solve this problem. But since we see how we can use LEFT JOINs to solve this problem, we have another application of using LEFT JOINs and where they may be appropriate. For more SQL lessons, you can continue to watch the SQL Basics series.