How To: Delta Live Tables with Ask Databricks
A Q&A on Delta Live Tables
In a recent episode of Ask Databricks hosted by Advancing Analytics, special guest Michael Armbrust joined for a Q&A on Delta Live Tables (view on Youtube).
Here are the key takeaways. Enjoy!
🚀 Can you use percentage run (%run) in a DLT pipeline? How do you extend your code base for DLT pipelines?
The magic command percent run (%run) isn’t supported in DLT, but you can achieve similar functionality using pipelines – a collection of notebooks and files. For shared code, think standard Python packages or attaching Python wheels to your DLT pipeline. DLT gives you options!
🔧 Is Change Data Feed separate from DLT, or does it integrate into a DLT pipeline?
CDC captures changes between systems, and Delta tables have change feeds, streamlining this process. DLT simplifies schema evolution and ordering, a game-changer with APPLY CHANGES INTO
, handling more than just SCD type 1.
🤔 How do you handle skeptical non-technical managers when convincing them of using DLT?
“DLT boosts productivity and slashes operational costs.” Explaining how DLT streamlines pipeline management and enhances parallelism can ease concerns, delivering value from data faster.
⚙️ Is there an upper limit to the number of tables in a DLT? When do you start breaking things up into smaller pipelines?
Recommendation: 10s to 100s of tables per DLT pipeline - though this heavily depends on the tables. The Spark driver will be the first to cause problems. Fix by increasing the driver memory. Beyond that, consider partitioning into separate pipelines for efficiency and scalability.
🛠️ What are the best practices for Medallion Architecture and DLT?
Understanding the nuances of streaming tables and materialized views is key. Streaming prioritizes performance, while materialized views ensure correctness for transformations. Choose the right tool for the job!
🛡️ Development best practices? How to test a DLT pipeline? Can you work locally?
Always use version control and break up logic into sources and transformations. Testing on Databricks itself is crucial since DLT doesn’t run locally yet. Create a separate testing environment for thorough testing. Tip: use Databricks Asset Bundles (currently in Public Preview) to easily deploy to various environments. For more info, see this blog post on software dev and DevOps for DLT by Databricks.
📈 What’s the performance difference between SQL and Python in DLT?
Performance difference is negligible. SQL and Python result in the same performance due to Spark SQL’s logical query planning. Python is great for metaprogramming (e.g., creating 10 tables in a loop), while SQL UDFs are efficient for certain operations (they’re often combinations of already vectorized SQL functions).
💡Any plans for a GUI for developing? A point-and-click interface?
If you desire a GUI, let your Databricks Representative know! That being said, while GUIs are great for demos, they can be challenging for real-life collaboration and version control.
🔄 Is it possible to run a table with all its downstream and upstream dependencies?
Currently, the Refresh Selected feature allows selecting tables to refresh in a DLT pipeline. However, direct dependency selection isn’t supported. [personal side note: dbt does offer this capability using the plus +
sign: dbt run --select +my_table+
]
🛠️ When would you use dbt or DLT?
You can use both! Use DLT for streaming tables or materialized views and add dbt for SQL queries on top of those. In general, choose based on your specific use case and requirements.