[{"content":"","date":"5 April 2026","externalUrl":null,"permalink":"/blog/","section":"Blog","summary":"","title":"Blog","type":"blog"},{"content":"","date":"5 April 2026","externalUrl":null,"permalink":"/categories/","section":"Categories","summary":"","title":"Categories","type":"categories"},{"content":" About Me # Data engineer with 10+ years of software development experience, now specializing in ETL pipeline architecture, analytics platform modernization, and MLOps. Expertise in transforming manual processes into automated, production-grade data solutions with a strong technical foundation in distributed systems.\n","date":"5 April 2026","externalUrl":null,"permalink":"/","section":"Damien GOEHRIG","summary":"\u003ch1 class=\"relative group\"\u003eAbout Me\n    \u003cdiv id=\"about-me\" class=\"anchor\"\u003e\u003c/div\u003e\n    \n    \u003cspan\n        class=\"absolute top-0 w-6 transition-opacity opacity-0 -start-6 not-prose group-hover:opacity-100 select-none\"\u003e\n        \u003ca class=\"text-primary-300 dark:text-neutral-700 !no-underline\" href=\"#about-me\" aria-label=\"Anchor\"\u003e#\u003c/a\u003e\n    \u003c/span\u003e\n    \n\u003c/h1\u003e\n\u003cp\u003eData engineer with 10+ years of software development experience, now specializing in ETL pipeline architecture, analytics platform modernization, and MLOps. Expertise in transforming manual processes into automated, production-grade data solutions with a strong technical foundation in distributed systems.\u003c/p\u003e","title":"Damien GOEHRIG","type":"page"},{"content":"","date":"5 April 2026","externalUrl":null,"permalink":"/categories/data-engineering/","section":"Categories","summary":"","title":"Data Engineering","type":"categories"},{"content":"","date":"5 April 2026","externalUrl":null,"permalink":"/tags/data-engineering/","section":"Tags","summary":"","title":"Data Engineering","type":"tags"},{"content":"","date":"5 April 2026","externalUrl":null,"permalink":"/tags/dbt/","section":"Tags","summary":"","title":"Dbt","type":"tags"},{"content":"Publishing a package on PyPI. It\u0026rsquo;s one of those things that looks intimidating from the outside, but turns out to be a matter of good timing and a precise enough problem to solve.\nThe Problem # When you work with dbt, you eventually have a DAG (directed acyclic graph) with dozens or hundreds of models depending on each other. The thing is, dbt doesn\u0026rsquo;t tell you when you break something downstream. You rename a column in a silver model, push your PR, the tests pass\u0026hellip; and three days later, someone realizes a gold dashboard is broken because it referenced that column.\nIt\u0026rsquo;s the kind of silent problem. No compilation error. No failing test. Just a downstream consumer left with missing data.\nDoesn\u0026rsquo;t Something Already Exist? # Before diving in, I did what everyone does: searched for whether someone had already solved the problem.\nThere are a few tools in the ecosystem that come close:\ndbt Core 1.5+ with model contracts: if you declare contract: enforced: true on a model, dbt detects breaking changes at run time. But that requires explicit configuration model by model. On an existing project, retrofitting everything isn\u0026rsquo;t realistic. An LLM can refactor your project to add the contracts, but I don\u0026rsquo;t love this use of contracts. I see them more like an OpenAPI spec (developer view) than a breaking change validation tool (ops view). Recce: the most complete open source tool for dbt change validation. It uses SQLGlot to analyze breaking changes, it\u0026rsquo;s serious and active. But its workflow is designed to compare two connected environments (dev and prod) with database access. dbt-manifest-differ: compares two manifests, but to debug why dbt marked a node as state:modified. Not to detect column-level breaking changes. None of them addressed my exact need: compare two manifests offline, without a Snowflake connection, without preconfiguring models.\nThe reason this gap exists, I think, is a combination of factors. Teams with DAGs large enough to suffer from this problem are often on dbt Cloud, which has breaking change detection behind its paid tier. And in the data engineering culture, tooling connects to the warehouse by reflex. The idea of static analysis on local JSON files is counter-intuitive in this ecosystem.\nThere\u0026rsquo;s also an implicit prerequisite: for static analysis on the manifest to work, columns need to be documented in YAML files. Which may not be the case in many dbt projects.\nThe other motivation is cost. A classic dbt CI pipeline runs a full dbt build. That means Snowflake compute, warehouses spinning up, credits burning. Every PR, every push. On a data stack of reasonable size, that adds up fast. And for what? To validate a column rename that could have been detected without ever touching Snowflake.\nThe Idea # I wanted something simple: a tool that compares two versions of a dbt manifest (the base branch vs the PR) and tells you \u0026ldquo;warning, you removed/renamed column X in model Y, and models Z1, Z2, Z3 depend on it.\u0026rdquo;\nNo database connection required. Zero Snowflake compute. 100% offline. Just static analysis on the manifest.json files that dbt already generates. You run dbt parse (which is near-instant and doesn\u0026rsquo;t touch your database), compare the manifests, and you\u0026rsquo;re done.\nHow It Works # The principle is fairly direct:\nYou give it two folders: one with the manifest from the main branch, the other with the one from your PR It parses both manifests and extracts columns from each model It compares and identifies changes: deleted columns, renames, type changes For breaking changes, it traverses the DAG in BFS (breadth-first search) to find all impacted downstream models Optionally, it traces column-level lineage to eliminate false positives: if a downstream model doesn\u0026rsquo;t reference the modified column, no alert The distinction between breaking and non-breaking is simple:\nBreaking: deleted column, rename, type change Non-breaking: new column, new model dbt-guard diff --base target/base --current target/current --dialect snowflake It outputs a report in text, JSON, or as GitHub Actions annotations. The last format is handy: alerts appear directly on the code lines in the PR.\nColumn-Level Lineage # This is the feature that gave me the most trouble, and also the one that makes the tool genuinely useful.\nWithout column lineage, if you modify a column in an upstream model, all downstream models get flagged as potentially impacted. With hundreds of models, that generates a lot of noise.\nWith lineage, dbt-guard traces which downstream columns actually reference the modified column. If a downstream model does SELECT col_a, col_b and you modified col_c, no alert. It\u0026rsquo;s SQLGlot doing this by parsing the SQL and building the dependency tree.\nObviously this has its limits. SELECT * is the classic case that complicates things. When a model does a SELECT *, you can\u0026rsquo;t statically know which columns it actually consumes. And some complex SQL patterns can fool the parser. But for the majority of cases, it works and significantly reduces noise.\nPublishing on PyPI # First Python package published. The process itself isn\u0026rsquo;t that mysterious: a properly configured pyproject.toml, some metadata, and pip install build \u0026amp;\u0026amp; python -m build \u0026amp;\u0026amp; twine upload dist/*. Still satisfying to see your package appear on pypi.org and be able to run pip install dbt-guard.\nI put the standard quality gates in place: pytest for tests, mypy for typing, ruff for linting, 80% coverage threshold. Tests use synthetic fixtures rather than real databases, consistent with the \u0026ldquo;no connection required\u0026rdquo; philosophy of the tool.\nWhat I Learned # A few lessons in no particular order:\nGraceful degradation matters. SQLGlot can\u0026rsquo;t parse every imaginable SQL pattern. Rather than crashing, dbt-guard falls back to the columns documented in the manifest when SQL parsing fails. Not perfect, but better than a blocking error.\nKeep dependencies minimal. Every dependency you add is a potential source of version conflicts in someone else\u0026rsquo;s environment. With just SQLGlot and Click, the chances of conflicts are low.\nSynthetic fixtures for tests. No need for a real Snowflake database to test a static analysis tool. Hand-crafted JSON manifests do the job and tests run in seconds.\nWhy Documentation as a Prerequisite Isn\u0026rsquo;t a Problem # One prerequisite for dbt-guard is that your columns are documented in YAML files. For someone coming from software development, this seems obvious: documentation is part of an API\u0026rsquo;s contract, not a nice-to-have. But in data engineering, it\u0026rsquo;s far from the norm.\nMy view: with LLMs, there\u0026rsquo;s no longer an excuse. Manually documenting hundreds of columns was painful. Today, an agent can read your SQL, infer business context and generate a first draft of YAML documentation in seconds. I go into more detail on this in the article on dbt documentation as governance.\nAnd in CI, enforcement is automatic via dbt_meta_testing: if a PR adds a model or column without a description, it doesn\u0026rsquo;t pass. Documentation isn\u0026rsquo;t optional, it\u0026rsquo;s enforced the same way tests are.\nA dbt Project Is Testable Infrastructure, Offline # What convinced me to go with the static approach is a simple analogy: a dbt project is a representation of your data stack\u0026rsquo;s infrastructure. The DAG, columns, dependencies, all of this lives in JSON and YAML files. Like Terraform for your data.\nAnd infrastructure can be validated offline. terraform plan doesn\u0026rsquo;t touch your cloud. dbt parse doesn\u0026rsquo;t touch Snowflake. The resulting manifest is a complete description of what the project is supposed to do.\nTesting that manifest means testing infrastructure before deploying it. It doesn\u0026rsquo;t replace tests on real data (uniqueness constraints, null values, freshness), but it lets you be strict about pipeline structure without any compute cost. It\u0026rsquo;s an upstream filter, fast and free, before you ever touch the database.\nThe Cost Argument # I\u0026rsquo;ll come back to this because it\u0026rsquo;s an underestimated point. A \u0026ldquo;standard\u0026rdquo; dbt CI running dbt build --target ci involves:\nA Snowflake warehouse waking up Models materializing (even in CI mode) Tests running on real data Credits leaving on every PR With dbt-guard, breaking change detection costs exactly zero Snowflake credits. It runs on the CI runner itself, in seconds, with local JSON files. It doesn\u0026rsquo;t replace your dbt CI (you still need that to validate logic), but it catches an entire category of problems before ever touching your database. It\u0026rsquo;s a fast, free upstream filter.\nWhat\u0026rsquo;s Next? # The tool does what it\u0026rsquo;s supposed to do. It\u0026rsquo;s a guardrail that integrates into your CI and blocks (or warns) when a PR risks breaking downstream consumers. No more, no less.\nIt\u0026rsquo;s my first published Python package. It\u0026rsquo;s not revolutionary, it\u0026rsquo;s not a framework that will change the world. It\u0026rsquo;s a tool that solves a specific problem I had, that nobody else had solved in open source yet, and that might be useful to others.\nThe code is on GitHub, under the Apache 2.0 license. If you work with dbt and have ever had the pleasure of discovering a breaking change in production on a Friday night, it might be worth checking out.\n","date":"5 April 2026","externalUrl":null,"permalink":"/blog/dbt-guard-package-python/","section":"Blog","summary":"\u003cp\u003ePublishing a package on PyPI. It\u0026rsquo;s one of those things that looks intimidating from the outside, but turns out to be a matter of good timing and a precise enough problem to solve.\u003c/p\u003e","title":"dbt-guard: My First Python Package (and Why I Needed It)","type":"blog"},{"content":"","date":"5 April 2026","externalUrl":null,"permalink":"/categories/open-source/","section":"Categories","summary":"","title":"Open Source","type":"categories"},{"content":"","date":"5 April 2026","externalUrl":null,"permalink":"/tags/open-source/","section":"Tags","summary":"","title":"Open Source","type":"tags"},{"content":"","date":"5 April 2026","externalUrl":null,"permalink":"/tags/python/","section":"Tags","summary":"","title":"Python","type":"tags"},{"content":"","date":"5 April 2026","externalUrl":null,"permalink":"/tags/","section":"Tags","summary":"","title":"Tags","type":"tags"},{"content":"","date":"27 February 2026","externalUrl":null,"permalink":"/categories/data-quality/","section":"Categories","summary":"","title":"Data Quality","type":"categories"},{"content":"","date":"27 February 2026","externalUrl":null,"permalink":"/tags/data-quality/","section":"Tags","summary":"","title":"Data Quality","type":"tags"},{"content":"Tu connais cette sensation : un rapport qui sort des chiffres bizarres, un analyste qui te dit \u0026ldquo;les totaux matchent pas\u0026rdquo;, et tu passes ta journée à remonter la chaîne pour trouver où les données ont dérapé. Souvent, le problème aurait pu être détecté automatiquement si quelqu\u0026rsquo;un avait mis un test quelque part.\nLes tests déclaratifs dans dbt # dbt a un système de tests intégré directement dans les YAML de documentation. C\u0026rsquo;est le même fichier qui documente tes colonnes et qui déclare tes tests. L\u0026rsquo;idée est simple : tu décris tes attentes sur les données, et dbt les vérifie à chaque exécution.\nLes quatre tests natifs :\nnot_null : cette colonne ne devrait jamais être vide unique : pas de doublons sur cette colonne accepted_values : les seules valeurs possibles sont cette liste relationships : cette colonne référence une autre table (intégrité référentielle) C\u0026rsquo;est déclaratif. Tu n\u0026rsquo;écris pas de SQL de test, tu déclares des contraintes.\nAu-delà des tests de base # Les quatre tests de base couvrent une bonne partie des besoins, mais pas tout. Pour le reste, il y a les packages de tests et les tests custom.\nLes tests de combinaison. \u0026ldquo;Cette combinaison de colonnes doit être unique.\u0026rdquo; Par exemple, une commande ne devrait apparaître qu\u0026rsquo;une seule fois par date et par client. C\u0026rsquo;est pas un simple unique sur une colonne, c\u0026rsquo;est une contrainte composite. dbt-utils fournit unique_combination_of_columns pour ça.\nLes tests de distribution. \u0026ldquo;Cette colonne ne devrait pas avoir plus de X% de valeurs nulles.\u0026rdquo; Utile pour les colonnes qui peuvent être nulles mais ne devraient pas l\u0026rsquo;être trop souvent.\nLes tests de fraîcheur. \u0026ldquo;La donnée la plus récente dans cette source ne devrait pas avoir plus de 24 heures.\u0026rdquo; Techniquement c\u0026rsquo;est un mécanisme séparé dans dbt (dbt source freshness), mais ça se déclare au même endroit dans les YAML. Si ta source arrête d\u0026rsquo;envoyer des données et que personne ne le remarque pendant une semaine, t\u0026rsquo;as un problème.\nLes tests de cohérence. \u0026ldquo;Le sous-total + taxes + livraison devrait être égal au total de commande.\u0026rdquo; C\u0026rsquo;est le genre de test qui attrape les bugs d\u0026rsquo;arrondi et les incohérences de calcul avant qu\u0026rsquo;un client ou fournisseur te le signale.\nTests sur les sources : la première ligne de défense # Un pattern que j\u0026rsquo;apprécie particulièrement : tester les données source, pas juste les modèles transformés.\nQuand tes données arrivent dans Snowflake via un outil de réplication (genre Fivetran, Airbyte), tu n\u0026rsquo;as aucune garantie sur leur qualité. Le système source peut avoir des bugs. La réplication peut avoir des problèmes. Les types peuvent changer sans prévenir.\nEn mettant des tests directement sur les définitions de sources dans dbt, tu crées une première ligne de défense :\nEst-ce que les colonnes que tu attends sont toujours là ? Est-ce que les types sont corrects ? Est-ce que les IDs sont bien uniques ? Est-ce qu\u0026rsquo;il y a des données récentes ? Quand un test source fail, ça te dit \u0026ldquo;le problème vient d\u0026rsquo;en amont, pas de ta transformation.\u0026rdquo; C\u0026rsquo;est de l\u0026rsquo;information précieuse pour le debug.\nEn pratique, c\u0026rsquo;est souvent le meilleur moyen de découvrir qu\u0026rsquo;un collègue du côté dev a fait un changement sur l\u0026rsquo;un de ses services sans passer par la case \u0026ldquo;prévenir l\u0026rsquo;équipe data\u0026rdquo;. Une colonne renommée, un nouveau statut ajouté, un type qui change silencieusement en prod. Sans tests sur les sources, tu le découvres quand un dashboard est cassé. Avec, tu catches rapidement pourquoi la dynamic table et tout le lineage fail. Tu roules tes tests, ça te donne une première piste, et tu peux aller poser la bonne question à la bonne équipe avant de partir débugger dans le mauvais sens.\nLa CI comme filet de sécurité # Les tests ne servent à rien si personne ne les roule. La pipeline CI est là pour ça.\nChaque PR qui touche aux modèles dbt déclenche un cycle complet :\nBuild des modèles en environnement CI Exécution de tous les tests Validation de la documentation (complétude) Si tout passe, la PR peut être mergée Le point clé : la CI fail si un test fail. Pas de warning ignoré, pas de \u0026ldquo;on corrigera plus tard.\u0026rdquo; Si tes données ne passent pas les contraintes que tu as déclarées, le code ne va pas en production.\nCI sur Snowflake : quelques réglages pour ne pas saigner des crédits # Si ta CI build une stack éphémère complète à chaque PR, il y a quelques réglages qui font une vraie différence sur la facture.\nDimensionner les warehouses selon le volume de CI. Un XS suffit pour la plupart des builds CI, inutile de sur-dimensionner. Le vrai paramètre à ajuster, c\u0026rsquo;est combien de runs parallèles tu peux avoir en simultané sur une journée chargée.\nUtiliser des databases transient pour la CI. Une database transient dans Snowflake ne conserve pas de Fail-Safe (la rétention de données en cas de corruption ou suppression accidentelle qui est activée par défaut sur les tables standard). Pour de la donnée CI qui est de toute façon recréée à chaque run, payer pour le Fail-Safe n\u0026rsquo;a aucun sens. Déclarer la database cible de la CI comme transient coupe ce coût sans aucun impact fonctionnel.\nNettoyer proprement à la fin. Le step de cleanup de ta CI doit dropper la database entière, pas juste les tables créées pendant le run. Un pipeline qui plante à mi-chemin sans cleanup laisse des objets orphelins qui tournent, notamment les Dynamic Tables, qui continuent à se rafraîchir et à consommer des crédits jusqu\u0026rsquo;à ce que quelqu\u0026rsquo;un les supprime manuellement. Vérifier de temps en temps que des vieilles CI databases ne traînent pas est une bonne hygiène.\nOverrider le lag des Dynamic Tables en CI. Par défaut, une Dynamic Table déployée en CI va essayer de se rafraîchir selon son lag cible : toutes les heures, toutes les 5 minutes, selon ce qui est déclaré en prod. En CI, tu veux exactement le contraire : qu\u0026rsquo;elles ne se rafraîchissent jamais toutes seules. La solution est d\u0026rsquo;overrider le target_lag à une valeur longue (genre 8760 hours, soit un an) dans ton profil CI. La table est créée, les tests tournent sur le contenu initial, et aucun refresh automatique ne vient perturber ou prolonger l\u0026rsquo;exécution.\nUtiliser --defer avec un manifest de prod. C\u0026rsquo;est probablement l\u0026rsquo;optimisation la plus impactante. dbt a une option --defer qui, combinée avec le manifest de la branche principale, permet de ne builder que les modèles modifiés dans la PR. Pour les modèles non touchés, dbt les \u0026ldquo;proxie\u0026rdquo; vers la version prod existante plutôt que de les recréer from scratch. Une PR qui modifie 3 modèles dans un DAG de 200 ne build que ces 3 modèles et leurs dépendants directs, pas la stack entière. Le gain en temps et en crédits est considérable sur les projets de taille respectable.\nLes tests comme documentation vivante # Ce qui est élégant avec les tests déclaratifs dans les YAML, c\u0026rsquo;est qu\u0026rsquo;ils servent aussi de documentation. Quand tu vois :\n- name: status description: \u0026#34;Statut de la commande\u0026#34; tests: - not_null - accepted_values: values: [\u0026#39;PENDING\u0026#39;, \u0026#39;PROCESSING\u0026#39;, \u0026#39;SHIPPED\u0026#39;, \u0026#39;DELIVERED\u0026#39;, \u0026#39;CANCELLED\u0026#39;] Tu sais immédiatement trois choses :\nCe que la colonne contient (description) Qu\u0026rsquo;elle ne peut pas être vide (not_null) Quelles sont ses valeurs possibles (accepted_values) C\u0026rsquo;est de la documentation qui se vérifie automatiquement. Quand un nouveau statut de commande apparaît dans les données, le test fail, la documentation est mise à jour, et tout le monde est au courant.\nLes erreurs qui m\u0026rsquo;ont convaincu # Quelques exemples concrets de problèmes attrapés par des tests :\nLe type boolean fantôme. Une colonne qui devrait être boolean mais qui contient des NULL en plus de true/false. Le code source traite NULL comme false, mais ta transformation dbt ne fait pas forcément pareil. Un test accepted_values: [true, false] combiné à not_null clarifie l\u0026rsquo;intention.\nL\u0026rsquo;ID en double. Un système source qui, suite à un bug de migration, a dupliqué quelques milliers d\u0026rsquo;enregistrements. Sans test unique, ces doublons se propagent silencieusement dans toute la chaîne de transformation.\nCe que j\u0026rsquo;en retiens # L\u0026rsquo;approche qui a marché pour moi part d\u0026rsquo;une hiérarchie simple.\nEn premier : les contraintes YAML. not_null, unique, accepted_values, relationships. Ces tests ne sont pas séparables de la documentation : ils sont la documentation. Déclarer qu\u0026rsquo;une colonne status accepte ['PENDING', 'SHIPPED', 'DELIVERED'], c\u0026rsquo;est à la fois documenter le contrat et le vérifier à chaque run. Le coût est quasi nul (une ligne de YAML), et ça donne une couverture de base sur toutes les colonnes sans effort particulier. C\u0026rsquo;est le minimum non négociable.\nEnsuite seulement : les tests complexes. Tests de cohérence, de distribution, de mapping : ceux qui nécessitent du SQL custom ou des packages comme dbt-utils. Ceux-là sont précieux, mais ils ont un coût : il faut les relire.\nC\u0026rsquo;est là où j\u0026rsquo;ai appris à ma dépens : déléguer la génération de tests à un LLM sans passer par une relecture sérieuse, c\u0026rsquo;est se retrouver avec de la fausse couverture. Des tests qui s\u0026rsquo;exécutent, qui passent, et qui ne testent pas vraiment ce qu\u0026rsquo;ils prétendent tester. C\u0026rsquo;est pire que pas de tests du tout, parce que ça donne une confiance non méritée. J\u0026rsquo;ai relu des tests générés par LLM que je n\u0026rsquo;avais pas vérifiés au moment du merge, et certains étaient tout simplement à côté de la plaque : logique inversée, mauvaise table référencée, seuil arbitraire sans sens métier. Ça passe mais ça ne sert à rien.\nLa règle que j\u0026rsquo;applique maintenant : les contraintes YAML, toujours, systématiquement, vérifiées en CI. Les tests complexes, seulement quand j\u0026rsquo;ai le temps de les relire ligne par ligne avant de les merger. Une couverture de 30% de tests bien compris vaut mieux qu\u0026rsquo;une couverture de 80% de tests dont personne ne sait vraiment ce qu\u0026rsquo;ils vérifient.\n","date":"27 February 2026","externalUrl":null,"permalink":"/fr/blog/dbt-tests-contraintes-yml/","section":"Blog","summary":"\u003cp\u003eTu connais cette sensation : un rapport qui sort des chiffres bizarres, un analyste qui te dit \u0026ldquo;les totaux matchent pas\u0026rdquo;, et tu passes ta journée à remonter la chaîne pour trouver où les données ont dérapé. Souvent, le problème aurait pu être détecté automatiquement si quelqu\u0026rsquo;un avait mis un test quelque part.\u003c/p\u003e","title":"dbt : Les tests dans les YAML, ou comment arrêter de prier pour que les données soient correctes","type":"blog"},{"content":"You know the feeling: a report spitting out weird numbers, an analyst telling you \u0026ldquo;the totals don\u0026rsquo;t match,\u0026rdquo; and you spend your day tracing back up the chain to find where the data went wrong. Often, the problem could have been detected automatically if someone had put a test somewhere.\nDeclarative Tests in dbt # dbt has a testing system built directly into the documentation YAMLs. It\u0026rsquo;s the same file that documents your columns and declares your tests. The idea is simple: you describe your expectations about the data, and dbt verifies them at every execution.\nThe four native tests:\nnot_null: this column should never be empty unique: no duplicates on this column accepted_values: the only possible values are this list relationships: this column references another table (referential integrity) It\u0026rsquo;s declarative. You don\u0026rsquo;t write test SQL, you declare constraints.\nBeyond Basic Tests # The four basic tests cover a good portion of needs, but not everything. For the rest, there are test packages and custom tests.\nCombination tests. \u0026ldquo;This combination of columns must be unique.\u0026rdquo; For example, an order should only appear once per date and per customer. It\u0026rsquo;s not a simple unique on one column, it\u0026rsquo;s a composite constraint. dbt-utils provides unique_combination_of_columns for this.\nDistribution tests. \u0026ldquo;This column should not have more than X% null values.\u0026rdquo; Useful for columns that can be null but shouldn\u0026rsquo;t be null too often.\nFreshness tests. \u0026ldquo;The most recent data in this source should not be more than 24 hours old.\u0026rdquo; Technically this is a separate mechanism in dbt (dbt source freshness), but it\u0026rsquo;s declared in the same place in the YAMLs. If your source stops sending data and nobody notices for a week, you have a problem.\nConsistency tests. \u0026ldquo;Subtotal + taxes + shipping should equal the order total.\u0026rdquo; This is the kind of test that catches rounding bugs and calculation inconsistencies before a customer or supplier reports them to you.\nTests on Sources: The First Line of Defense # A pattern I particularly appreciate: testing source data, not just transformed models.\nWhen your data arrives in Snowflake via a replication tool (like Fivetran, Airbyte), you have no guarantee about its quality. The source system can have bugs. Replication can have issues. Types can change without warning.\nBy putting tests directly on source definitions in dbt, you create a first line of defense:\nAre the columns you expect still there? Are the types correct? Are IDs properly unique? Is there recent data? When a source test fails, it tells you \u0026ldquo;the problem comes from upstream, not from your transformation.\u0026rdquo; That\u0026rsquo;s valuable information for debugging.\nIn practice, this is often the best way to discover that a colleague on the dev side made a change to one of their services without going through the \u0026ldquo;notify the data team\u0026rdquo; step. A renamed column, a new status added, a type that silently changes in production. Without source tests, you discover it when a dashboard is broken. With them, you catch quickly why the dynamic table and the whole lineage are failing. You run your tests, they give you a first lead, and you can go ask the right question to the right team before debugging in the wrong direction.\nCI as a Safety Net # Tests are useless if nobody runs them. The CI pipeline is there for that.\nEvery PR touching dbt models triggers a complete cycle:\nBuild models in CI environment Execute all tests Validate documentation completeness If everything passes, the PR can be merged The key point: CI fails if a test fails. No ignored warnings, no \u0026ldquo;we\u0026rsquo;ll fix it later.\u0026rdquo; If your data doesn\u0026rsquo;t pass the constraints you declared, the code doesn\u0026rsquo;t go to production.\nCI on Snowflake: A Few Settings to Stop Bleeding Credits # If your CI builds a complete ephemeral stack on every PR, a few settings make a real difference on the bill.\nSize warehouses to CI volume. An XS is enough for most CI builds, no need to oversize. The real parameter to tune is how many parallel runs you can have simultaneously on a busy day.\nUse transient databases for CI. A transient database in Snowflake doesn\u0026rsquo;t retain Fail-Safe (the data retention in case of corruption or accidental deletion that\u0026rsquo;s enabled by default on standard tables). For CI data that gets recreated on every run anyway, paying for Fail-Safe makes no sense. Declaring the CI target database as transient cuts this cost with no functional impact.\nClean up properly at the end. Your CI cleanup step must drop the entire database, not just the tables created during the run. A pipeline that crashes midway without cleanup leaves orphaned objects running, especially Dynamic Tables which keep refreshing and consuming credits until someone manually deletes them. Periodically checking that old CI databases aren\u0026rsquo;t hanging around is good hygiene.\nOverride Dynamic Table lag in CI. By default, a Dynamic Table deployed in CI will try to refresh according to its target lag: every hour, every 5 minutes, whatever is declared for production. In CI, you want the exact opposite: for them to never refresh automatically. The solution is to override target_lag to a long value (like 8760 hours, meaning one year) in your CI profile. The table is created, tests run on the initial content, and no automatic refresh comes to disrupt or extend execution.\nUse --defer with a production manifest. This is probably the most impactful optimization. dbt has a --defer option that, combined with the main branch manifest, lets you only build the models modified in the PR. For unmodified models, dbt \u0026ldquo;proxies\u0026rdquo; them to the existing production version instead of recreating them from scratch. A PR that modifies 3 models in a 200-model DAG builds only those 3 models and their direct dependents, not the entire stack. The time and credit savings are considerable on larger projects.\nTests as Living Documentation # What\u0026rsquo;s elegant about declarative tests in YAMLs is that they also serve as documentation. When you see:\n- name: status description: \u0026#34;Order status\u0026#34; tests: - not_null - accepted_values: values: [\u0026#39;PENDING\u0026#39;, \u0026#39;PROCESSING\u0026#39;, \u0026#39;SHIPPED\u0026#39;, \u0026#39;DELIVERED\u0026#39;, \u0026#39;CANCELLED\u0026#39;] You immediately know three things:\nWhat the column contains (description) That it can\u0026rsquo;t be empty (not_null) What its possible values are (accepted_values) It\u0026rsquo;s documentation that verifies itself automatically. When a new order status appears in the data, the test fails, the documentation gets updated, and everyone knows about it.\nThe Errors That Convinced Me # A few concrete examples of problems caught by tests:\nThe phantom boolean type. A column that should be boolean but contains NULL in addition to true/false. The source code treats NULL as false, but your dbt transformation doesn\u0026rsquo;t necessarily do the same. An accepted_values: [true, false] test combined with not_null clarifies the intent.\nThe duplicate ID. A source system that, following a migration bug, duplicated a few thousand records. Without a unique test, these duplicates silently propagate through the entire transformation chain.\nWhat I Take Away # The approach that worked for me starts with a simple hierarchy.\nFirst: YAML constraints. not_null, unique, accepted_values, relationships. These tests aren\u0026rsquo;t separable from documentation: they are the documentation. Declaring that a status column accepts ['PENDING', 'SHIPPED', 'DELIVERED'] is both documenting the contract and verifying it at every run. The cost is near zero (one line of YAML), and it gives basic coverage across all columns with no particular effort. This is the non-negotiable minimum.\nThen and only then: complex tests. Consistency tests, distribution tests, mapping tests: those requiring custom SQL or packages like dbt-utils. These are valuable, but they have a cost: you have to re-read them.\nThat\u0026rsquo;s where I learned the hard way: delegating test generation to an LLM without serious review means ending up with false coverage. Tests that execute, that pass, and that don\u0026rsquo;t actually test what they claim to test. That\u0026rsquo;s worse than no tests at all, because it gives unearned confidence. I reviewed LLM-generated tests I hadn\u0026rsquo;t checked at merge time, and some were simply off the mark: inverted logic, wrong table referenced, arbitrary threshold with no business meaning. They pass but serve no purpose.\nThe rule I apply now: YAML constraints, always, systematically, enforced in CI. Complex tests, only when I have time to read them line by line before merging. 30% coverage of well-understood tests is worth more than 80% coverage of tests nobody really knows what they\u0026rsquo;re checking.\n","date":"27 February 2026","externalUrl":null,"permalink":"/blog/dbt-tests-constraints-yml/","section":"Blog","summary":"\u003cp\u003eYou know the feeling: a report spitting out weird numbers, an analyst telling you \u0026ldquo;the totals don\u0026rsquo;t match,\u0026rdquo; and you spend your day tracing back up the chain to find where the data went wrong. Often, the problem could have been detected automatically if someone had put a test somewhere.\u003c/p\u003e","title":"dbt: Tests in YAML, or How to Stop Praying Your Data Is Correct","type":"blog"},{"content":"","date":"27 February 2026","externalUrl":null,"permalink":"/tags/testing/","section":"Tags","summary":"","title":"Testing","type":"tags"},{"content":"","date":"6 February 2026","externalUrl":null,"permalink":"/categories/ai/","section":"Categories","summary":"","title":"AI","type":"categories"},{"content":"","date":"6 February 2026","externalUrl":null,"permalink":"/tags/ai/","section":"Tags","summary":"","title":"AI","type":"tags"},{"content":"Documenter les colonnes d\u0026rsquo;une base source, c\u0026rsquo;est le genre de tâche que personne ne veut faire. T\u0026rsquo;as un système opérationnel avec des centaines de tables, des milliers de colonnes, et une documentation qui va de \u0026ldquo;inexistante\u0026rdquo; à \u0026ldquo;un commentaire de 2017 qui dit TODO: document this.\u0026rdquo;\nLe contexte # Quand tu travailles avec dbt et que tu définis tes sources, tu veux idéalement documenter chaque colonne. Pas juste son nom et son type, mais ce qu\u0026rsquo;elle représente réellement, ses particularités, ses valeurs possibles, ses relations avec d\u0026rsquo;autres tables.\nC\u0026rsquo;est la couche bronze (les données brutes telles qu\u0026rsquo;elles arrivent des systèmes sources) qui est la plus difficile à documenter. Contrairement aux couches silver et gold, où la transformation elle-même est une forme de documentation (le SQL dit ce que la donnée est censée être), la couche bronze hérite des conventions, des bugs et des décisions de design du système qui l\u0026rsquo;alimente. La connaissance ne vit pas dans dbt, elle vit dans la codebase applicative.\nC\u0026rsquo;est là aussi que tout repose. Si tu ne sais pas ce que signifie une colonne en bronze, tu ne peux pas documenter correctement sa transformation en silver, ni la métrique business qu\u0026rsquo;elle alimente en gold. La documentation se construit de bas en haut, et le bas, c\u0026rsquo;est le plus dur.\nLe problème, c\u0026rsquo;est que cette connaissance est souvent dispersée. Elle est dans le code applicatif qui écrit dans ces tables. Elle est dans la tête des développeurs backend. Elle est parfois dans un wiki que personne n\u0026rsquo;a mis à jour depuis 2019.\nEt personne n\u0026rsquo;a envie de passer 3 semaines à éplucher du code legacy pour comprendre ce que legacy_field_42 veut dire.\nL\u0026rsquo;idée : des agents LLM spécialisés # L\u0026rsquo;approche que j\u0026rsquo;ai expérimentée, c\u0026rsquo;est d\u0026rsquo;utiliser des agents LLM pour faire le gros du travail d\u0026rsquo;investigation. Pas un seul prompt géant qui essaie de tout comprendre d\u0026rsquo;un coup, mais une approche multi-agent où chaque agent a un rôle spécifique.\nLe principe :\nAgent explorateur : parcourt le schéma de la base source, identifie les tables et les colonnes, note les types, les FK apparentes, les patterns de nommage Agent analyste de code : prend le code applicatif qui interagit avec chaque table et analyse comment chaque colonne est utilisée : en lecture, en écriture, les validations appliquées, les transformations Agent documentaliste : synthétise les informations des deux premiers agents et produit une documentation structurée au format YAML de dbt Chaque agent travaille table par table, colonne par colonne. C\u0026rsquo;est méthodique et systématique.\nCe que les agents découvrent # Le plus intéressant, c\u0026rsquo;est ce que les agents trouvent que personne ne savait (ou avait oublié) :\nLes colonnes détournées. Une colonne notes qui en théorie contient du texte libre, mais qui en pratique stocke du JSON sérialisé avec une structure spécifique que le frontend parse.\nLes valeurs magiques. Un status qui vaut 0, 1, 2, 3, 4. Mais personne ne sait que 3 veut dire \u0026ldquo;en attente de validation manuelle\u0026rdquo; et 4 c\u0026rsquo;est \u0026ldquo;annulé automatiquement par le système.\u0026rdquo; L\u0026rsquo;agent qui analyse le code trouve les constantes et les conditions.\nLes contraintes implicites. Une colonne qui n\u0026rsquo;a pas de contrainte NOT NULL en base, mais que le code applicatif ne laisse jamais vide. Ou une colonne qui devrait être unique mais qui a des doublons à cause d\u0026rsquo;un bug corrigé il y a 3 ans.\nLes données sérialisées. Du JSON, du XML, des formats propriétaires dans un champ texte. L\u0026rsquo;agent identifie le format et documente la structure interne.\nLes relations non documentées. Des FK qui n\u0026rsquo;existent pas en base mais que le code utilise systématiquement. Des colonnes qui référencent d\u0026rsquo;autres tables via une convention de nommage que personne n\u0026rsquo;a formalisée.\nL\u0026rsquo;intégration dans dbt # Une fois les YAML générés et validés, ils s\u0026rsquo;intègrent directement dans le projet dbt comme définitions de sources. Avec persist_docs activé, les descriptions remontent dans Snowflake et les métadonnées de classification alimentent les politiques de gouvernance. Ce mécanisme est couvert en détail dans l\u0026rsquo;article sur les YAML comme gouvernance.\nCe qui compte ici : les agents transforment un exercice de documentation fastidieux en base concrète pour une gouvernance active, sans que ça soit un projet séparé.\nDu bronze aux couches supérieures # Une fois la couche bronze documentée, quelque chose change dans la façon dont on documente le reste.\nEn silver, chaque modèle dbt est une transformation explicite depuis des sources connues. Le SQL lui-même dit beaucoup : une colonne total_amount calculée par unit_price * quantity n\u0026rsquo;a pas besoin d\u0026rsquo;une longue description. Ce qui compte, c\u0026rsquo;est de documenter les décisions de nettoyage, les règles de déduplication, les cas limites. Et ça, un LLM peut l\u0026rsquo;inférer en lisant le SQL et la documentation bronze en parallèle.\nEn gold, les modèles sont souvent des agrégations business. Les colonnes correspondent à des métriques dont le sens est dans la logique métier, pas dans le code. C\u0026rsquo;est là que la documentation devient plus manuelle, mais au moins tu pars d\u0026rsquo;une base solide. Tu sais exactement ce que chaque champ upstream représente, ce qui rend la documentation des métriques dérivées beaucoup plus précise.\nL\u0026rsquo;effet de levier est réel : la couche bronze est la plus longue à documenter et la plus difficile à automatiser partiellement. Les couches supérieures bénéficient directement de ce travail de fondation. Chaque colonne bronze correctement décrite se propage dans le lignage et réduit le travail de documentation des couches qui en dépendent.\nC\u0026rsquo;est aussi ce qui rend la documentation bronze si rentable à faire en premier, malgré l\u0026rsquo;effort : c\u0026rsquo;est le seul endroit où la connaissance est enfouie dans une codebase externe, et donc le seul endroit où les agents LLM ont un vrai avantage sur un data engineer qui ne connaît pas ce code.\nLes limites # Soyons honnêtes sur ce qui marche moins bien :\nLe contexte métier. Un LLM peut comprendre que creation_date est une date de création. Il ne peut pas savoir que dans votre contexte, cette date a une signification contractuelle précise qui affecte d\u0026rsquo;autres calculs en aval. Le contexte métier fin, ça reste humain.\nLe code legacy illisible. Quand le code qui interagit avec une table est un fichier de 3000 lignes sans structure claire, même un LLM a du mal à en extraire une documentation cohérente.\nLa validation. Tout ce que produit un LLM doit être validé par quelqu\u0026rsquo;un qui connaît le domaine. Les agents font le gros du boulot, mais la validation, la correction et l\u0026rsquo;ajout de contexte métier restent essentiels et irremplaçables. Pis comme je le dis souvent, on est responsable de notre utilisation de l\u0026rsquo;IA, et ça inclut la validation de ce qu\u0026rsquo;elle produit.\nLe workflow complet # En pratique :\nTu donnes à tes agents le dump du schéma de la base source et le code applicatif Les agents produisent des fichiers YAML documentés, table par table Un humain review, corrige les erreurs, ajoute le contexte métier manquant Les YAML corrigés deviennent les définitions de sources dans dbt Les métadonnées sont poussées vers Snowflake via persist_docs Les classifications alimentent les politiques de gouvernance Le temps total ? Pour une base de quelques centaines de tables : quelques jours d\u0026rsquo;agents + quelques jours de review humaine. Sans les agents, c\u0026rsquo;est des semaines, voire des mois, de travail manuel que personne ne veut faire.\nUn dernier conseil pratique : quitte à faire ça, autant y aller franchement. Prompt de reverse engineering complet, mode deep thinking activé, review systématique table par table. Ça veut dire brûler quelques millions de tokens chez nos amis d\u0026rsquo;OpenAI, Anthropic ou Google, mais c\u0026rsquo;est un investissement ponctuel pour un actif qui dure. Lance ça de nuit. Le matin, t\u0026rsquo;as une première version de doc sur toute ta couche bronze, et tu n\u0026rsquo;as pas eu à te taper une seule ligne de legacy_field_42 à la main.\nLa leçon # La documentation de source, c\u0026rsquo;est un des meilleurs use cases pour les LLM dans le data engineering. C\u0026rsquo;est pas glamour, c\u0026rsquo;est pas du machine learning, c\u0026rsquo;est pas de la data science. C\u0026rsquo;est du travail de fond, fastidieux mais essentiel, que les LLM font bien parce que c\u0026rsquo;est systématique, que le contexte est dans le code, et que la sortie est structurée.\nEt contrairement à d\u0026rsquo;autres applications de LLM, ici la validation est simple : un data engineer ou un développeur backend peut vérifier la documentation produite en quelques minutes par table. Les erreurs sont faciles à repérer et à corriger.\nC\u0026rsquo;est pas magique. C\u0026rsquo;est juste un bon outil appliqué au bon problème.\n","date":"6 February 2026","externalUrl":null,"permalink":"/fr/blog/dbt-documenter-source-llm-multi-agent/","section":"Blog","summary":"\u003cp\u003eDocumenter les colonnes d\u0026rsquo;une base source, c\u0026rsquo;est le genre de tâche que personne ne veut faire. T\u0026rsquo;as un système opérationnel avec des centaines de tables, des milliers de colonnes, et une documentation qui va de \u0026ldquo;inexistante\u0026rdquo; à \u0026ldquo;un commentaire de 2017 qui dit \u003ccode\u003eTODO: document this\u003c/code\u003e.\u0026rdquo;\u003c/p\u003e","title":"Documenter une base de données source avec des LLM multi-agents","type":"blog"},{"content":"Documenting columns in a source database is the kind of task nobody wants to do. You have an operational system with hundreds of tables, thousands of columns, and documentation ranging from \u0026ldquo;nonexistent\u0026rdquo; to \u0026ldquo;a 2017 comment that says TODO: document this.\u0026rdquo;\nThe Context # When you work with dbt and define your sources, you ideally want to document every column. Not just its name and type, but what it actually represents, its quirks, its possible values, its relationships with other tables.\nThe bronze layer (raw data as it arrives from source systems) is the hardest to document. Unlike silver and gold layers, where the transformation itself is a form of documentation (the SQL says what the data is supposed to be), the bronze layer inherits the conventions, bugs and design decisions of the system that feeds it. The knowledge doesn\u0026rsquo;t live in dbt, it lives in the application codebase.\nThis is also where everything rests. If you don\u0026rsquo;t know what a bronze column means, you can\u0026rsquo;t correctly document its silver transformation, or the business metric it feeds at the gold layer. Documentation builds from the bottom up, and the bottom is the hardest part.\nThe problem is that this knowledge is often scattered. It\u0026rsquo;s in the application code that writes to these tables. It\u0026rsquo;s in the heads of backend developers. Sometimes it\u0026rsquo;s in a wiki nobody has updated since 2019.\nAnd nobody wants to spend 3 weeks sifting through legacy code to understand what legacy_field_42 means.\nThe Idea: Specialized LLM Agents # The approach I experimented with is using LLM agents to do the heavy lifting of investigation. Not a single giant prompt trying to understand everything at once, but a multi-agent approach where each agent has a specific role.\nThe principle:\nExplorer agent: traverses the source database schema, identifies tables and columns, notes types, apparent FKs, naming patterns Code analyst agent: takes the application code that interacts with each table and analyzes how each column is used: reads, writes, validations applied, transformations Documentarian agent: synthesizes information from the two previous agents and produces structured documentation in dbt YAML format Each agent works table by table, column by column. Methodical and systematic.\nWhat the Agents Discover # The most interesting part is what the agents find that nobody knew (or had forgotten):\nRepurposed columns. A notes column that theoretically contains free text, but in practice stores serialized JSON with a specific structure that the frontend parses.\nMagic values. A status that takes values 0, 1, 2, 3, 4. But nobody remembers that 3 means \u0026ldquo;pending manual validation\u0026rdquo; and 4 means \u0026ldquo;automatically cancelled by the system.\u0026rdquo; The agent analyzing the code finds the constants and conditions.\nImplicit constraints. A column with no NOT NULL constraint in the database, but which the application code never leaves empty. Or a column that should be unique but has duplicates due to a bug fixed 3 years ago.\nSerialized data. JSON, XML, proprietary formats inside a text field. The agent identifies the format and documents the internal structure.\nUndocumented relationships. FKs that don\u0026rsquo;t exist in the database but the code uses systematically. Columns referencing other tables via a naming convention nobody formalized.\nIntegration into dbt # Once the YAMLs are generated and validated, they integrate directly into the dbt project as source definitions. With persist_docs enabled, descriptions surface in Snowflake and classification metadata feeds governance policies. This mechanism is covered in detail in the article on YAMLs as governance.\nWhat matters here: the agents transform a tedious documentation exercise into a concrete foundation for active governance, without it being a separate project.\nFrom Bronze to Upper Layers # Once the bronze layer is documented, something changes in how you document the rest.\nIn silver, each dbt model is an explicit transformation from known sources. The SQL itself says a lot: a total_amount column calculated from unit_price * quantity doesn\u0026rsquo;t need a long description. What matters is documenting cleaning decisions, deduplication rules, edge cases. And an LLM can infer those by reading the SQL and the bronze documentation in parallel.\nIn gold, models are often business aggregations. Columns correspond to metrics whose meaning is in business logic, not in code. That\u0026rsquo;s where documentation becomes more manual, but at least you start from a solid foundation. You know exactly what each upstream field represents, which makes documenting derived metrics much more precise.\nThe leverage effect is real: the bronze layer is the longest to document and the hardest to partially automate. Upper layers benefit directly from this foundation work. Every correctly described bronze column propagates through the lineage and reduces documentation work for the layers that depend on it.\nThis is also what makes bronze documentation so worthwhile to do first, despite the effort: it\u0026rsquo;s the only place where knowledge is buried in an external codebase, and therefore the only place where LLM agents have a real advantage over a data engineer who doesn\u0026rsquo;t know that code.\nThe Limits # Let\u0026rsquo;s be honest about what works less well:\nBusiness context. An LLM can understand that creation_date is a creation date. It can\u0026rsquo;t know that in your context, this date has a precise contractual meaning that affects downstream calculations. Fine-grained business context remains human.\nUnreadable legacy code. When the code interacting with a table is a 3000-line file with no clear structure, even an LLM struggles to extract coherent documentation from it.\nValidation. Everything an LLM produces must be validated by someone who knows the domain. Agents do the heavy lifting, but validation, correction and adding business context remain essential and irreplaceable. And as I often say, we\u0026rsquo;re responsible for how we use AI, and that includes validating what it produces.\nThe Complete Workflow # In practice:\nYou give your agents the source database schema dump and the application code The agents produce documented YAML files, table by table A human reviews, corrects errors, adds missing business context The corrected YAMLs become the source definitions in dbt Metadata is pushed to Snowflake via persist_docs Classifications feed governance policies Total time? For a database with a few hundred tables: a few days of agents plus a few days of human review. Without agents, that\u0026rsquo;s weeks or months of manual work that nobody wants to do.\nOne last practical tip: if you\u0026rsquo;re going to do this, go all in. Full reverse-engineering prompt, deep thinking mode enabled, systematic table-by-table review. This means burning a few million tokens with our friends at OpenAI, Anthropic or Google, but it\u0026rsquo;s a one-time investment for a long-lasting asset. Run it overnight. In the morning, you have a first version of documentation for your entire bronze layer, and you didn\u0026rsquo;t have to manually work through a single legacy_field_42 yourself.\nThe Lesson # Source documentation is one of the best use cases for LLMs in data engineering. It\u0026rsquo;s not glamorous, it\u0026rsquo;s not machine learning, it\u0026rsquo;s not data science. It\u0026rsquo;s foundational work, tedious but essential, that LLMs do well because it\u0026rsquo;s systematic, the context is in the code, and the output is structured.\nAnd unlike other LLM applications, here validation is straightforward: a data engineer or backend developer can verify the produced documentation in a few minutes per table. Errors are easy to spot and correct.\nIt\u0026rsquo;s not magic. It\u0026rsquo;s just a good tool applied to the right problem.\n","date":"6 February 2026","externalUrl":null,"permalink":"/blog/dbt-document-sources-llm-multi-agent/","section":"Blog","summary":"\u003cp\u003eDocumenting columns in a source database is the kind of task nobody wants to do. You have an operational system with hundreds of tables, thousands of columns, and documentation ranging from \u0026ldquo;nonexistent\u0026rdquo; to \u0026ldquo;a 2017 comment that says \u003ccode\u003eTODO: document this\u003c/code\u003e.\u0026rdquo;\u003c/p\u003e","title":"Documenting a Source Database with Multi-Agent LLMs","type":"blog"},{"content":"","date":"6 February 2026","externalUrl":null,"permalink":"/tags/llm/","section":"Tags","summary":"","title":"LLM","type":"tags"},{"content":"","date":"16 January 2026","externalUrl":null,"permalink":"/categories/data-governance/","section":"Categories","summary":"","title":"Data Governance","type":"categories"},{"content":"","date":"16 January 2026","externalUrl":null,"permalink":"/tags/data-governance/","section":"Tags","summary":"","title":"Data Governance","type":"tags"},{"content":"La documentation, c\u0026rsquo;est le truc que personne ne veut faire. Surtout en data. T\u0026rsquo;as des centaines de colonnes dans des dizaines de tables, et quelqu\u0026rsquo;un te demande \u0026ldquo;c\u0026rsquo;est quoi le champ status dans la table orders ?\u0026rdquo; Et la réponse honnête, c\u0026rsquo;est souvent \u0026ldquo;euh\u0026hellip; un enum je pense qui veut probablement dire X.\u0026rdquo;\nLe problème de la documentation data # Dans un projet dbt classique, la documentation est optionnelle. Tu peux écrire tes modèles SQL, les déployer, et ne jamais documenter une seule colonne. dbt ne t\u0026rsquo;oblige à rien.\nLe résultat, c\u0026rsquo;est prévisible : des schémas Snowflake avec des centaines de colonnes dont personne ne connaît la signification exacte. Des noms de colonnes hérités d\u0026rsquo;un système source vieux de 10 ans. Des colonnes qui s\u0026rsquo;appellent type ou status sans aucune indication de ce que ça veut dire.\nEt quand un analyste marketing veut comprendre les données, il doit soit déranger quelqu\u0026rsquo;un qui a la connaissance tribale, soit deviner. Les deux sont problématiques.\nLes YAML de dbt : plus qu\u0026rsquo;une formalité # dbt a un mécanisme de documentation intégré : les fichiers schema.yml (ou peu importe comment tu les nommes). Tu peux y décrire chaque modèle, chaque colonne, avec du texte libre. La plupart des équipes s\u0026rsquo;en servent peu ou pas.\nMais si tu prends le temps de bien structurer ces fichiers, ils deviennent bien plus qu\u0026rsquo;une documentation passive. Ils deviennent la source de vérité pour la gouvernance de tes données.\nL\u0026rsquo;idée, c\u0026rsquo;est d\u0026rsquo;utiliser le champ meta de chaque colonne pour stocker des métadonnées structurées :\nClassification : est-ce que cette colonne contient de l\u0026rsquo;information personnelle (PII), financière, confidentielle ? Catégorie sémantique : est-ce un email, un montant, une adresse, un identifiant ? Sensibilité : haute, moyenne, basse ? Obligations réglementaires : LPRPDE, GDPR, données consommateurs, e-commerce ? Rétention : combien de temps garder ces données ? Quand chaque colonne a ses métadonnées, tu passes d\u0026rsquo;une documentation passive à une gouvernance active.\npersist_docs : du YAML à Snowflake # Le truc qui fait la différence, c\u0026rsquo;est persist_docs. C\u0026rsquo;est une option dbt qui prend tes descriptions YAML et les pousse comme commentaires sur les objets Snowflake. Quand tu actives ça :\nmodels: mon_projet: +persist_docs: relation: true columns: true Chaque description de modèle et de colonne dans tes YAML devient un commentaire visible dans Snowflake. Pas besoin d\u0026rsquo;un outil externe. Quelqu\u0026rsquo;un qui navigue les données dans Snowsight voit directement les descriptions que t\u0026rsquo;as écrites dans dbt.\nEt si tu utilises Snowflake Horizon (leur plateforme de gouvernance), ces descriptions alimentent directement le catalogue de données. Ta documentation dbt EST ta documentation Snowflake. Une seule source de vérité.\nForcer la documentation en CI # Mon point de vue là-dessus est assez tranché : la documentation, c\u0026rsquo;est aussi obligatoire que le code lui-même. Pas un nice-to-have, pas quelque chose qu\u0026rsquo;on fera \u0026ldquo;quand on aura le temps\u0026rdquo;. Si tu déploies du code non testé en production, c\u0026rsquo;est un problème. Déployer un modèle non documenté devrait l\u0026rsquo;être tout autant.\nEt si tu automatises tout le reste (déploiements, tests, validations), pourquoi la documentation échapperait-elle à cette logique ? La CI est la réponse évidente. C\u0026rsquo;est le premier check que j\u0026rsquo;y mets : avant même de valider la logique des transformations, on vérifie que la documentation est là.\nConcrètement, ça donne une étape qui valide la complétude :\nChaque modèle Silver et Gold doit avoir une description Chaque colonne dans le SQL doit être présente dans le YAML Chaque colonne documentée doit avoir une description non vide Si une PR ajoute un modèle sans documentation, la CI fail. Point final. C\u0026rsquo;est la seule façon de maintenir la discipline sur le long terme. Le package dbt-meta-testing fait exactement ça : il expose des macros required_docs et required_tests que tu branches dans ta pipeline.\nLes tags de gouvernance : du YAML au masquage # Là où ça devient puissant, c\u0026rsquo;est quand tu combines les métadonnées YAML avec les fonctionnalités de gouvernance de Snowflake.\nTu déclares dans ton YAML qu\u0026rsquo;une colonne contient du PII. dbt peut appliquer un tag Snowflake correspondant quand il crée le modèle, via un post-hook qu\u0026rsquo;on écrit nous-mêmes, c\u0026rsquo;est pas du built-in. Et Snowflake, grâce à des politiques de masquage liées aux tags, masque automatiquement la valeur pour les utilisateurs qui n\u0026rsquo;ont pas le bon rôle.\nLe résultat : tu documentes tes colonnes dans le YAML, et le masquage de données se fait tout seul. Pas de logique de masquage dans le SQL, pas de vues spéciales, pas de maintenance. La gouvernance découle directement de la documentation.\nLes tests : la documentation qui se vérifie # Les YAML de dbt ne servent pas qu\u0026rsquo;à la documentation, ils servent aussi aux tests. Et c\u0026rsquo;est là que ça boucle : ta documentation devient vérifiable.\nTu documentes qu\u0026rsquo;un order_id est une clé primaire ? Mets un test unique et not_null. Tu documentes qu\u0026rsquo;un statut ne peut avoir que certaines valeurs ? Mets un test accepted_values. Tu documentes qu\u0026rsquo;une colonne référence une autre table ? Mets un test de relation.\nLes tests sont déclarés au même endroit que la documentation. Un seul fichier YAML qui dit : \u0026ldquo;cette colonne s\u0026rsquo;appelle X, elle contient Y, elle ne peut pas être nulle, et ses valeurs possibles sont Z.\u0026rdquo; La documentation et les tests sont la même chose. On revient là-dessus en détail dans l\u0026rsquo;article sur les tests dbt.\nQuand les tests passent, ta documentation est prouvée correcte. Quand un test fail, ta documentation ou tes données sont fausses. Dans les deux cas, tu dois investiguer.\nLes contacts de gouvernance # Un aspect souvent négligé : qui est responsable de quoi ? Dans les métadonnées de chaque modèle, tu peux déclarer un propriétaire, un steward, un approbateur. Ces métadonnées sont poussées vers Snowflake via un post-hook custom (encore une fois, c\u0026rsquo;est du bricolage maison, pas du dbt natif).\nQuand quelqu\u0026rsquo;un trouve un problème de données, il sait exactement qui contacter. Pas besoin de chercher dans un wiki ou de demander à la cantonade. L\u0026rsquo;information est attachée directement aux données elles-mêmes.\nLe coût réel : c\u0026rsquo;est moins que tu penses # Le reproche classique : \u0026ldquo;ça prend du temps de documenter chaque colonne.\u0026rdquo; C\u0026rsquo;est vrai. Mais compare avec l\u0026rsquo;alternative :\nDes heures perdues par les analystes à deviner ce que les colonnes veulent dire Des erreurs dans les rapports parce que quelqu\u0026rsquo;un a interprété un champ de travers Des audits qui prennent des semaines parce que personne sait quelles données sont sensibles Des incidents de sécurité parce qu\u0026rsquo;une colonne PII n\u0026rsquo;était pas identifiée comme telle Documenter une colonne prend 30 secondes. Ne pas la documenter peut coûter des heures, des jours, ou pire. Et avec les LLMs, le coût de départ a encore baissé : documenter une base source entière en quelques jours n\u0026rsquo;est plus une utopie.\nLe bonus : la documentation comme clé du talk to my data # La documentation sert deux audiences : tes collègues humains, et les LLMs. Et c\u0026rsquo;est là que ça devient intéressant.\nSnowflake Cortex Analyst est la réponse de Snowflake au \u0026ldquo;talk to my data\u0026rdquo; : poser une question en langage naturel et obtenir la requête SQL correcte en retour. Demander une KPI précise, la filtrer selon des critères, la comparer avec une autre métrique, sans écrire une ligne de SQL. Ça paraît magique. Et les gens pensent spontanément que c\u0026rsquo;est des mois de travail d\u0026rsquo;ingénierie pour y arriver.\nCe n\u0026rsquo;est pas le cas, si la documentation est propre.\nCortex Analyst fonctionne à partir d\u0026rsquo;un modèle sémantique : un fichier YAML qui décrit les tables, les colonnes, les métriques, les relations entre entités, et le vocabulaire métier associé. La structure de ce fichier est très proche de ce que dbt produit déjà dans ses fichiers de documentation. L\u0026rsquo;infrastructure existe. Les modèles existent. Si les descriptions de colonnes sont là, si les relations sont documentées, si les métriques clés sont définies, le gap pour Cortex Analyst est faible.\nSnowflake Labs a même publié dbt_semantic_view, un package qui génère des Semantic Views Snowflake directement depuis le modèle sémantique dbt. Ces Semantic Views sont nativement exploitables par Cortex Analyst. Le pipeline devient : dbt documente → dbt_semantic_view publie les vues sémantiques → Cortex Analyst répond aux questions en langage naturel.\nUn projet dbt bien documenté est à 2-4 semaines d\u0026rsquo;un talk to my data fonctionnel, pas à 6 mois. Le travail restant (formaliser quelques métriques, distinguer dimensions et mesures, ajouter des synonymes métier) est marginal comparé à ce qui est déjà en place. C\u0026rsquo;est un quick win qui ne se débloque qu\u0026rsquo;avec une condition : avoir traité la documentation comme une contrainte dès le début, pas comme une tâche à faire plus tard.\nCe que j\u0026rsquo;en retiens # Au final, ce qui a marché pour moi, c\u0026rsquo;est de traiter la documentation des données avec le même sérieux que le code de transformation. Pas comme un nice-to-have qu\u0026rsquo;on fera \u0026ldquo;quand on aura le temps.\u0026rdquo; Comme une partie intégrante du pipeline, vérifiée en CI, propagée automatiquement.\nLes YAML de dbt sont l\u0026rsquo;endroit idéal pour ça. Un seul fichier qui sert à la documentation humaine, aux tests automatisés, et à la gouvernance de données. Ça sonne fancy dit comme ça, mais en pratique c\u0026rsquo;est juste des fichiers YAML qui font leur job.\n","date":"16 January 2026","externalUrl":null,"permalink":"/fr/blog/dbt-documentation-gouvernance-yml/","section":"Blog","summary":"\u003cp\u003eLa documentation, c\u0026rsquo;est le truc que personne ne veut faire. Surtout en data. T\u0026rsquo;as des centaines de colonnes dans des dizaines de tables, et quelqu\u0026rsquo;un te demande \u0026ldquo;c\u0026rsquo;est quoi le champ \u003ccode\u003estatus\u003c/code\u003e dans la table \u003ccode\u003eorders\u003c/code\u003e ?\u0026rdquo; Et la réponse honnête, c\u0026rsquo;est souvent \u0026ldquo;euh\u0026hellip; un enum je pense qui veut probablement dire X.\u0026rdquo;\u003c/p\u003e","title":"dbt : Quand tes fichiers YAML deviennent ta gouvernance de données","type":"blog"},{"content":"Documentation is the thing nobody wants to do. Especially in data. You have hundreds of columns across dozens of tables, and someone asks \u0026ldquo;what\u0026rsquo;s the status field in the orders table?\u0026rdquo; And the honest answer is often \u0026ldquo;uh\u0026hellip; an enum I think that probably means X.\u0026rdquo;\nThe Data Documentation Problem # In a typical dbt project, documentation is optional. You can write your SQL models, deploy them, and never document a single column. dbt doesn\u0026rsquo;t force you to do anything.\nThe result is predictable: Snowflake schemas with hundreds of columns that nobody knows the exact meaning of. Column names inherited from a 10-year-old source system. Columns called type or status with no indication of what they mean.\nAnd when a marketing analyst wants to understand the data, they either have to bother someone who has the tribal knowledge, or guess. Both are problematic.\ndbt YAMLs: More Than a Formality # dbt has a built-in documentation mechanism: schema.yml files (or whatever you call them). You can describe each model, each column, with free text. Most teams use them little or not at all.\nBut if you take the time to structure these files properly, they become much more than passive documentation. They become the source of truth for your data governance.\nThe idea is to use the meta field of each column to store structured metadata:\nClassification: does this column contain personal (PII), financial, or confidential information? Semantic category: is it an email, an amount, an address, an identifier? Sensitivity: high, medium, low? Regulatory obligations: PIPEDA, GDPR, consumer data, e-commerce? Retention: how long to keep this data? When every column has its metadata, you move from passive documentation to active governance.\npersist_docs: From YAML to Snowflake # The thing that makes the difference is persist_docs. It\u0026rsquo;s a dbt option that takes your YAML descriptions and pushes them as comments on Snowflake objects. When you enable it:\nmodels: my_project: +persist_docs: relation: true columns: true Every model and column description in your YAMLs becomes a comment visible in Snowflake. No external tool needed. Someone browsing data in Snowsight sees the descriptions you wrote in dbt directly.\nAnd if you use Snowflake Horizon (their governance platform), these descriptions feed directly into the data catalog. Your dbt documentation IS your Snowflake documentation. A single source of truth.\nEnforcing Documentation in CI # My view on this is pretty firm: documentation is as mandatory as the code itself. Not a nice-to-have, not something we\u0026rsquo;ll do \u0026ldquo;when we have time.\u0026rdquo; If you deploy untested code to production, that\u0026rsquo;s a problem. Deploying an undocumented model should be just as much of one.\nAnd if you automate everything else (deployments, tests, validations), why would documentation escape this logic? CI is the obvious answer. It\u0026rsquo;s the first check I put there: before even validating the transformation logic, we verify that the documentation exists.\nIn practice, this means a step that validates completeness:\nEvery Silver and Gold model must have a description Every column in the SQL must be present in the YAML Every documented column must have a non-empty description If a PR adds a model without documentation, CI fails. Full stop. It\u0026rsquo;s the only way to maintain discipline over the long term. The dbt-meta-testing package does exactly this: it exposes required_docs and required_tests macros that you wire into your pipeline.\nGovernance Tags: From YAML to Masking # This is where it gets powerful: combining YAML metadata with Snowflake\u0026rsquo;s governance features.\nYou declare in your YAML that a column contains PII. dbt can apply a corresponding Snowflake tag when it creates the model, via a post-hook you write yourself (not built-in, this is custom work). And Snowflake, through masking policies tied to tags, automatically masks the value for users who don\u0026rsquo;t have the right role.\nThe result: you document your columns in YAML, and data masking happens on its own. No masking logic in SQL, no special views, no maintenance. Governance flows directly from documentation.\nTests: The Documentation That Verifies Itself # dbt YAMLs aren\u0026rsquo;t just for documentation, they\u0026rsquo;re for tests too. And that\u0026rsquo;s where it loops back: your documentation becomes verifiable.\nYou document that an order_id is a primary key? Add a unique and not_null test. You document that a status can only have certain values? Add an accepted_values test. You document that a column references another table? Add a relationship test.\nTests are declared in the same place as documentation. A single YAML file that says: \u0026ldquo;this column is called X, it contains Y, it can\u0026rsquo;t be null, and its possible values are Z.\u0026rdquo; Documentation and tests are the same thing. We go into this in detail in the article on dbt tests.\nWhen tests pass, your documentation is proven correct. When a test fails, your documentation or your data is wrong. Either way, you need to investigate.\nGovernance Contacts # An often overlooked aspect: who is responsible for what? In each model\u0026rsquo;s metadata, you can declare an owner, a steward, an approver. These metadata are pushed to Snowflake via a custom post-hook (again, this is handcrafted, not native dbt).\nWhen someone finds a data problem, they know exactly who to contact. No need to search a wiki or ask around. The information is attached directly to the data itself.\nThe Real Cost: Less Than You Think # The classic objection: \u0026ldquo;it takes time to document every column.\u0026rdquo; True. But compare it to the alternative:\nHours lost by analysts guessing what columns mean Errors in reports because someone misinterpreted a field Audits that take weeks because nobody knows which data is sensitive Security incidents because a PII column wasn\u0026rsquo;t identified as such Documenting a column takes 30 seconds. Not documenting it can cost hours, days, or worse. And with LLMs, the startup cost has dropped further: documenting an entire source database in a few days is no longer a pipe dream.\nThe Bonus: Documentation as the Key to Talking to Your Data # Documentation serves two audiences: your human colleagues, and LLMs. And that\u0026rsquo;s where it gets interesting.\nSnowflake Cortex Analyst is Snowflake\u0026rsquo;s answer to \u0026ldquo;talk to my data\u0026rdquo;: ask a question in natural language and get back the correct SQL query. Ask for a specific KPI, filter by criteria, compare with another metric, without writing a line of SQL. It sounds like magic. And people instinctively think it takes months of engineering work to get there.\nIt doesn\u0026rsquo;t, if the documentation is clean.\nCortex Analyst works from a semantic model: a YAML file describing tables, columns, metrics, relationships between entities, and associated business vocabulary. The structure of this file is very close to what dbt already produces in its documentation files. The infrastructure exists. The models exist. If column descriptions are there, if relationships are documented, if key metrics are defined, the gap to Cortex Analyst is small.\nSnowflake Labs has even published dbt_semantic_view, a package that generates Snowflake Semantic Views directly from the dbt semantic model. These Semantic Views are natively usable by Cortex Analyst. The pipeline becomes: dbt documents, dbt_semantic_view publishes the semantic views, Cortex Analyst answers questions in natural language.\nA well-documented dbt project is 2 to 4 weeks away from a working talk-to-your-data experience, not 6 months. The remaining work (formalizing a few metrics, distinguishing dimensions from measures, adding business synonyms) is marginal compared to what\u0026rsquo;s already in place. It\u0026rsquo;s a quick win that only unlocks under one condition: treating documentation as a constraint from the start, not a task to do later.\nWhat I Take Away # In the end, what worked for me is treating data documentation with the same seriousness as transformation code. Not as a nice-to-have we\u0026rsquo;ll do \u0026ldquo;when we have time.\u0026rdquo; As an integral part of the pipeline, verified in CI, propagated automatically.\ndbt YAMLs are the ideal place for this. A single file serving human documentation, automated tests, and data governance. That sounds fancy when you say it that way, but in practice it\u0026rsquo;s just YAML files doing their job.\n","date":"16 January 2026","externalUrl":null,"permalink":"/blog/dbt-documentation-governance-yml/","section":"Blog","summary":"\u003cp\u003eDocumentation is the thing nobody wants to do. Especially in data. You have hundreds of columns across dozens of tables, and someone asks \u0026ldquo;what\u0026rsquo;s the \u003ccode\u003estatus\u003c/code\u003e field in the \u003ccode\u003eorders\u003c/code\u003e table?\u0026rdquo; And the honest answer is often \u0026ldquo;uh\u0026hellip; an enum I think that probably means X.\u0026rdquo;\u003c/p\u003e","title":"dbt: When Your YAML Files Become Your Data Governance","type":"blog"},{"content":"","date":"16 January 2026","externalUrl":null,"permalink":"/tags/snowflake/","section":"Tags","summary":"","title":"Snowflake","type":"tags"},{"content":"Snowflake is fundamentally SQL-first. That\u0026rsquo;s its strength: everything is driven by SQL, from grants to object creation to transformations. Infrastructure, we\u0026rsquo;ve seen how to tame it with Terraform in the previous article. But data transformations fall into a blind spot. SQL scripts scattered everywhere, no tests, no serious versioning, one colleague who knows what order to run things in.\nThat\u0026rsquo;s where dbt comes in. And it\u0026rsquo;s no coincidence that both tools work so naturally together: Snowflake abstracts physical infrastructure, dbt abstracts transformation orchestration. Both are declarative, SQL-first, designed so that code is the source of truth. dbt isn\u0026rsquo;t a generic data transformation framework, it\u0026rsquo;s the tool that completes Snowflake where Snowflake doesn\u0026rsquo;t complete itself: the organization, tests, documentation and reproducible deployment of all that SQL.\nThe Problem with SQL Transformation Scripts # Before dbt, the classic data pipeline looked like this: numbered SQL scripts, some scheduler, and a README explaining what order to run things. Or worse, stored procedures in Snowflake with business logic buried inside.\nThe problem is the same as with infrastructure SQL scripts: DDLs accumulate. You have create_orders_table.sql, then add_status_column.sql, then alter_orders_add_shipping_address.sql. After a while, nobody knows what the table is supposed to be. You have to run all the scripts in order to reconstruct the real state, and you hope none of them are missing.\nAnd when you want to add a column to a 5-stage pipeline (staging, enrichment, aggregation, export, report), you have to open 5 files, add the column in each one, then coordinate the deployment in the right order in production. Without a framework, you manage this by hand. And you almost always forget a model somewhere.\nAs for tests: writing a non-null test on a column without a framework means writing a SELECT COUNT(*) WHERE col IS NULL and checking the result is 0. Doable once. Painful to maintain across 200 columns.\nWhat dbt Changes Fundamentally # dbt (data build tool) takes a simple idea: data transformations are code. And code is managed with the same practices as everything else: version control, tests, documentation, CI/CD.\nIn practice, you write models: .sql files that each define a table or view. A model references other models with {{ ref('model_name') }} syntax. dbt automatically resolves dependencies and builds a DAG (directed acyclic graph) of your transformations.\n-- models/silver/orders_enriched.sql select o.order_id, o.created_at, o.status, c.email as customer_email, p.name as product_name from {{ ref(\u0026#39;stg_orders\u0026#39;) }} o left join {{ ref(\u0026#39;stg_customers\u0026#39;) }} c on o.customer_id = c.customer_id left join {{ ref(\u0026#39;stg_products\u0026#39;) }} p on o.product_id = p.product_id dbt knows that orders_enriched depends on stg_orders, stg_customers and stg_products. It builds them in the right order, automatically. When you add a column to stg_orders, it\u0026rsquo;s available in all downstream models without you having to coordinate anything.\nThe Terraform Analogy # The parallel with Terraform isn\u0026rsquo;t superficial. Both tools share the same fundamental philosophy:\nDesired state, not steps. You don\u0026rsquo;t say \u0026ldquo;run this transformation after that one.\u0026rdquo; You describe what each table should contain, and dbt figures out how to get there.\nDeclarative. Your code describes the result, not the process. SELECT ... FROM ref('source') says \u0026ldquo;this table contains these columns calculated from this source,\u0026rdquo; not \u0026ldquo;take this table, join it with that, filter this.\u0026rdquo;\nReproducible. dbt build rebuilds everything from scratch, in the right order, every time. Like terraform apply rebuilds your infrastructure from HCL files.\nVersioned. Every change to a transformation is a commit. You can see who changed what, why, and roll back.\nThe analogy has one important limit though: dbt has no remote state. If someone deletes a table in Snowflake manually, dbt doesn\u0026rsquo;t know about it. The next dbt run will simply recreate it without flagging the drift. Terraform, on the other hand, would detect the gap between the real state and the desired state and propose correcting it. This isn\u0026rsquo;t a critical flaw, but worth keeping in mind: dbt is declarative about what it creates, not about what exists.\nExtensibility: What More Than Compensates # That lack of remote state is one of the few areas where Terraform does better. But dbt compensates with something different: remarkable extensibility.\nThe dbt Hub aggregates hundreds of community packages. Additional tests, utility macros, integrations with specific sources. dbt-utils is probably in every serious dbt project: it adds dozens of macros and tests you wouldn\u0026rsquo;t want to write yourself.\nBut the real power is the Jinja macro system. Everything dbt does internally, you can do too. Which means features reserved for dbt Cloud are often reproducible in dbt Core with a few macros.\nBreaking change detection in schemas? We cover that in a later article, and it\u0026rsquo;s exactly this kind of feature you can implement yourself. Alerting on source freshness? Macros. Automatic documentation generation? Macros. dbt Core isn\u0026rsquo;t a stripped-down version of dbt Cloud, it\u0026rsquo;s a foundation on which you build what you need.\ndbt as Infrastructure, Not Just an ETL Runner # Many teams use dbt as an ETL runner: they materialize tables and run dbt run every hour in Airflow or Prefect. That\u0026rsquo;s a valid use case. But it\u0026rsquo;s far from exhausting what dbt can do.\nSnowflake has a concept that changes the equation: Dynamic Tables. Instead of materializing a table and refreshing it manually, you declare a Dynamic Table with a target lag (\u0026ldquo;this table must be no more than 1 hour stale\u0026rdquo;). Snowflake manages the refresh automatically, propagating changes through the DAG.\nCombined with dbt, this means you no longer need an external orchestrator to manage refreshes. You declare your models as Dynamic Tables in dbt, deploy, and Snowflake handles the rest. dbt becomes what it truly is: a tool for declaring data infrastructure, not a scheduler.\n# dbt_project.yml models: my_project: silver: +materialized: dynamic_table +target_lag: \u0026#39;1 hour\u0026#39; +snowflake_warehouse: TRANSFORMING_L Targets: One Environment per Context # One of dbt\u0026rsquo;s most practical concepts is the targets system. A target is a deployment configuration: which database, which schema, which warehouse to use.\nIn practice, each developer has a default dev target pointing to their own isolated database, with a small warehouse to avoid burning credits unnecessarily:\n# profiles.yml my_project: target: dev outputs: dev: database: DEV_JEAN schema: silver warehouse: DEV_XS prod: database: PROD schema: silver warehouse: TRANSFORMING_L When Jean runs dbt run, he deploys to DEV_JEAN.silver. He can\u0026rsquo;t accidentally overwrite production. Deploying to prod requires being explicit: dbt run --target prod. It\u0026rsquo;s a deliberate opt-in, not the default behavior.\nWhat\u0026rsquo;s elegant is that dbt lets you override the database, schema, and even the warehouse at multiple levels: project-wide, folder-level, individual model. You can have a specific model that always runs on a dedicated warehouse, regardless of the target. Configuration composes.\nPrecise Selection in the DAG # Once you have a DAG with 200 models, you don\u0026rsquo;t want to systematically rebuild everything. dbt has an expressive selection system for targeting exactly what you need.\nThe + syntax controls direction in the DAG:\n# Just the model dbt run --select orders_enriched # The model and everything upstream (its dependencies) dbt run --select +orders_enriched # The model and everything downstream (what depends on it) dbt run --select orders_enriched+ # The model, its dependencies AND its dependents dbt run --select +orders_enriched+ You can also select by tag. If you tag your retail models:\n-- models/silver/orders_enriched.sql {{ config(tags=[\u0026#39;retail\u0026#39;, \u0026#39;orders\u0026#39;]) }} You can then target the entire retail domain in one command:\ndbt run --select tag:retail Combined with state:modified, this gives very precise control over what runs in CI: only the modified models and their direct dependents.\nSources and Exposures: The Two Ends of the DAG # dbt manages not only intermediate transformations but also both ends of the pipeline.\nSources are tables you don\u0026rsquo;t control: your raw data arriving via Fivetran, Airbyte, or another replication tool. You declare them in YAML:\nsources: - name: shopify database: RAW schema: shopify tables: - name: orders - name: customers - name: products This lets you reference them with {{ source('shopify', 'orders') }} in your models, associate tests with them, and monitor their freshness. If Shopify data hasn\u0026rsquo;t been updated in 6 hours, dbt can alert you before your reports go stale.\nThese source descriptions also integrate with Snowflake Horizon: the metadata you declare in dbt surfaces in the Snowflake data catalog, visible to everyone with access to the instance.\nExposures are the other end: consumers of your data that dbt doesn\u0026rsquo;t manage. A Tableau dashboard, an API, a file exported to a partner. You declare them too:\nexposures: - name: tableau_revenue_dashboard type: dashboard owner: name: BI Team depends_on: - ref(\u0026#39;daily_revenue\u0026#39;) - ref(\u0026#39;top_products\u0026#39;) This completes the lineage: you can see in the DAG not only how data transforms, but also where it ends up. When you modify daily_revenue, dbt knows the Tableau dashboard depends on it and can alert you.\nComplementary Assets: Seeds and UDFs # Two elements worth mentioning to complete the picture.\nSeeds are CSV files that dbt manages as tables. Reference tables, mappings, static configurations. Instead of having a CSV file somewhere on a server or in a Google Sheet, you version it in the dbt repo and dbt creates the corresponding table in Snowflake.\nseeds/ product_categories.csv country_codes.csv shipping_carriers.csv Snowflake UDFs (User Defined Functions) aren\u0026rsquo;t natively managed by dbt, but you can deploy them via pre-hooks or macros. This is the limit of the infrastructure-as-code metaphor: certain Snowflake objects remain in Terraform territory.\nThe Manifest: The Compiled Plan of Your Stack # When you run dbt parse, dbt compiles your entire project and produces a manifest.json. This is the central artifact: a complete, machine-readable representation of everything your project knows how to do.\nThe manifest contains models, their declared columns, their dependencies, tests, sources, exposures. Everything.\nIt\u0026rsquo;s the equivalent of the Terraform state for your data. And like the Terraform state, it can be used for comparisons: what changed between the previous version and the current version? That\u0026rsquo;s exactly what we\u0026rsquo;ll leverage in a later article to detect breaking changes before they reach production.\nWhy This Changes Team Dynamics # Before dbt, transformations were often owned by one person. Someone knew what order to run the scripts, which tables depended on what, where the deduplication logic lived. The knowledge was tribal.\nWith dbt, any data engineer can open the repo, see the DAG, understand the dependencies, modify a model, and deploy to their dev environment without risk. The knowledge is in the code.\nIt\u0026rsquo;s the same gain we had with Terraform for infrastructure. We go from \u0026ldquo;ask the person who knows\u0026rdquo; to \u0026ldquo;read the code.\u0026rdquo;\nWhy dbt and Not SQLMesh? # When I started evaluating tools to manage my transformations declaratively, SQLMesh was on my list. It\u0026rsquo;s a serious tool: open source, Apache 2.0, backwards-compatible with existing dbt projects, with interesting concepts like automatic breaking change detection and native incremental evaluation.\nOn paper, SQLMesh is technically more rigorous than dbt on several points. But I chose dbt for one simple reason: network effects.\ndbt is by far the most widely used tool in the data ecosystem. Which means:\nHundreds of packages on the dbt Hub: tests, macros, utilities already written An active community with answers to almost every problem you\u0026rsquo;ll encounter Native integrations in all stack tools: Fivetran, Hightouch, Metabase, Tableau, and essentially all data catalogs Data engineers who already know dbt when they join a team And since 2024, Snowflake has formalized this partnership with an even deeper integration: the dbt Snowflake Native App. You can now orchestrate and monitor your dbt jobs directly from Snowflake, without external infrastructure. dbt runs inside your Snowflake instance, not alongside it.\nSQLMesh would probably have done the job just as well. But when you choose infrastructure, you also choose an ecosystem. And the dbt ecosystem is unbeatable right now.\ndbt + Snowflake: The Duo for a Dev-Minded Data Engineer # What makes this combination particularly solid is that both tools share the same philosophy and complement each other without overlapping.\nTerraform declares infrastructure: databases, schemas, roles, permissions, masking policies. dbt declares transformations: models, tests, documentation, lineage. Both are code, versioned in git, deployed via CI pipelines, with environments isolated per developer.\nBut beyond the shared philosophy, the integration is concrete. Snowflake Dynamic Tables materialize natively in dbt. Column descriptions in your YAML surface in Snowflake Horizon via persist_docs. And since 2024, dbt can run directly in your Snowflake instance via the dbt Snowflake Native App, without external infrastructure to manage.\nThe complete stack looks like this: Terraform manages what\u0026rsquo;s above the data, dbt manages what\u0026rsquo;s inside it, and Snowflake is the foundation both rely on. Each layer has its responsibility, each layer is code. For someone coming from software development, this is exactly what a data stack should look like. Not SQL scripts in a shared folder, not tribal knowledge, not manual deployment on a Friday night.\nThe rigor we apply to application code has every reason to apply to data too. dbt is the tool that makes this possible, and Snowflake is the platform where that rigor makes complete sense.\n","date":"26 December 2025","externalUrl":null,"permalink":"/blog/dbt-data-infrastructure-as-code/","section":"Blog","summary":"\u003cp\u003eSnowflake is fundamentally SQL-first. That\u0026rsquo;s its strength: everything is driven by SQL, from grants to object creation to transformations. Infrastructure, we\u0026rsquo;ve seen how to tame it with Terraform in \u003ca\n  href=\"https://damiengoehrig.ca/blog/snowflake-terraform-infrastructure-as-code/\"\u003ethe previous article\u003c/a\u003e. But data transformations fall into a blind spot. SQL scripts scattered everywhere, no tests, no serious versioning, one colleague who knows what order to run things in.\u003c/p\u003e","title":"dbt: Treating Your Data Transformations Like Infrastructure","type":"blog"},{"content":"","date":"26 December 2025","externalUrl":null,"permalink":"/tags/sql/","section":"Tags","summary":"","title":"SQL","type":"tags"},{"content":"","date":"5 December 2025","externalUrl":null,"permalink":"/categories/infrastructure/","section":"Categories","summary":"","title":"Infrastructure","type":"categories"},{"content":"","date":"5 December 2025","externalUrl":null,"permalink":"/tags/infrastructure-as-code/","section":"Tags","summary":"","title":"Infrastructure as Code","type":"tags"},{"content":"There\u0026rsquo;s a moment in every data engineer\u0026rsquo;s life when you find yourself staring at a 300-line SQL file that creates roles, grants, warehouses, and you wonder how you got here. This is my story.\nSnowflake: All SQL, for Better and Worse # Snowflake has accomplished something few platforms manage: completely abstracting physical infrastructure while retaining granular control over everything that matters. No servers to maintain, no clusters to size. Just databases, schemas, warehouses, roles, and SQL commands to drive everything.\nSnowsight, its UI, is capable. You can create objects, manage access, visualize data. But like any tool that exists in both UI and programmatic form, the real power is on the code side. The UI gives you access to features. SQL gives you control. Programmatic manipulation gives you reproducibility.\nIt\u0026rsquo;s the classic analogy with command-line tools: they seem less accessible than a GUI, but they\u0026rsquo;re scriptable, versionable, automatable. Every action becomes reproducible, auditable, integrable into a pipeline. A GRANT executed in Snowsight disappears the moment you close the tab. The same GRANT declared in Terraform code is tracked, reviewable, reversible.\nThe problem is that Snowflake is so accessible in ad hoc SQL that you end up doing everything that way. A GRANT here, a new role through the console, a warehouse created in an emergency on a Tuesday night. Each of these actions is harmless on its own. Together, they become infrastructure that\u0026rsquo;s impossible to audit.\nThe SQL Scripts Era # When I first started building a Snowflake infrastructure, I did what everyone does: .sql files. One script to create databases, another for schemas, another for roles. Simple, direct, it works.\nExcept it works up to a point.\nFirst week: 5 SQL files, well organized. First month: 15 files, a few IF NOT EXISTS for idempotence. Third week: you realize you need to modify a role and you\u0026rsquo;re scanning 4 files to make sure you don\u0026rsquo;t miss a grant somewhere. And you end up wondering whether the current state of Snowflake actually matches what\u0026rsquo;s in your scripts, or if someone ran a GRANT manually in the console on a Tuesday night.\nThe Migrations Episode # Then I tried migrations. Like in application development: numbered files, each with an incremental change. 001_create_databases.sql, 002_add_marketing_role.sql, 003_fix_grant_on_silver.sql\u0026hellip;\nOn paper, it\u0026rsquo;s better. You have a history. You can trace the evolution. But in practice:\nIf someone makes a manual change, the migrations and reality silently diverge Rolling back a REVOKE or DROP ROLE isn\u0026rsquo;t like an ALTER TABLE, the cascade effects are unpredictable You have no idea about the current state without running a full audit And above all, you end up with 50 migration files and nobody knows what the infrastructure is supposed to be, just what it\u0026rsquo;s become The Obvious Answer: Terraform # Then one day I wondered if there was a Terraform provider for Snowflake. The answer: yes, and it\u0026rsquo;s maintained by Snowflake directly.\nIt seemed like the obvious fit. Terraform is exactly the right tool for this problem:\nDeclarative: you describe the desired state, not the steps to get there Plan before apply: terraform plan shows you exactly what will change before it changes Single source of truth: the Terraform code describes the desired state of the infrastructure, not an approximation, not a history of migrations History via git: every change is a commit, reviewable, reversible Idempotent: you can run it 10 times, it gives the same result Compared to SQL scripts: if someone makes a manual change in the Snowflake console, the next terraform plan shows it as drift. You see the difference between the real state and the desired state. With SQL scripts, you see nothing.\n(In reality, Terraform has its own irritants: state that gets corrupted, painful imports, breaking changes in the provider between versions. But these problems are manageable, and the gain in visibility more than compensates.)\nRoles: Thinking in Layers # The concept that helped me most is structuring roles in three levels. It\u0026rsquo;s a practice recommended by Snowflake in their access control documentation, documented by several Snowflake architects (example on the official blog).\nLevel 1: Snowflake system roles. ACCOUNTADMIN, SYSADMIN, SECURITYADMIN. You don\u0026rsquo;t create them, they already exist. But you configure Terraform to use the right role in the right place: SYSADMIN to create databases and warehouses, SECURITYADMIN for roles and grants. Principle of least privilege.\nLevel 2: Access roles. These are technical, granular roles that grant access to a specific schema at a specific level. Like SILVER_RO (read-only on the Silver schema), GOLD_RW (read-write on Gold), STAGING_FULL (full access to the staging schema). Naming convention matters: by reading the role name, you know exactly what it does.\nLevel 3: Functional roles. These are the business roles, the ones you assign to humans. An ANALYST role, an ENGINEER role, a REPORTING role. Each functional role aggregates multiple access roles. The Analyst role gets SILVER_RO + GOLD_RO. The engineer gets broader access.\nThe flow is simple: User → Functional role → Access roles → Permissions on schemas.\nThe advantage of this approach: when a new analyst joins, you assign them to the ANALYST functional role and they automatically have access to everything they need. No grant list to maintain manually.\nCascading Grants # One of the most satisfying aspects of this approach is grant cascading. In Snowflake, you can create a role hierarchy: one role can \u0026ldquo;contain\u0026rdquo; another.\nConcretely, if you have three access levels for a schema (read, write, create), you structure it as a cascade:\nThe Create role inherits from the Read-Write role The Read-Write role inherits from the Read-Only role You only need to grant privileges once at each level When you assign the Create role to someone, they automatically get write and read permissions through inheritance. No grant duplication, no extra maintenance.\nFuture Grants: Anticipating the Future # A critical pattern in Snowflake: future grants. When you create a role with SELECT on a schema, it applies to tables that exist now. But what about when dbt creates a new model tomorrow? Without future grants, nobody has access to it.\nTerraform lets you declare future grants: \u0026ldquo;all future objects in this schema will automatically inherit these permissions.\u0026rdquo; It\u0026rsquo;s the kind of detail that makes the difference between infrastructure that works on day 1 and infrastructure that still works 6 months later when the data team has added 50 models.\nUsers: Configuration, Not Code # Adding a user, in this approach, doesn\u0026rsquo;t require writing Terraform code. You fill in a configuration:\nTheir identity (name, email) Their team (marketing, finance, data engineering\u0026hellip;) Are they an admin? Do they need a development sandbox? From this configuration, Terraform automatically calculates and generates:\nThe user account Assignment to their team\u0026rsquo;s functional role Default warehouse Sandbox access if applicable Adding a new member means modifying a few lines of configuration, running terraform plan to verify, and terraform apply. No SQL to write, no grants to hunt for.\nThe Terraform / dbt Boundary # An important point: where does Terraform stop and where does dbt begin?\nMy view: Terraform manages everything above the schema. dbt manages everything below.\nTerraform handles:\nCreating databases and schemas Creating warehouses Managing roles and permissions Defining masking policies Configuring monitoring dbt handles:\nCreating tables and views within schemas Transforming data Applying governance tags to columns Documenting models Testing data quality Infrastructure (Terraform) evolves slowly: a new schema per month, a new role per quarter. Transformations (dbt) evolve every day: new models, business logic, corrections.\nThis separation means two teams can work in parallel without stepping on each other. The platform engineer manages Terraform, the data engineer manages dbt. Each in their own repo, with their own deployment cycle.\nData Masking: A Good Integration Example # A concrete example of how Terraform and dbt collaborate: data masking.\nTerraform creates tag-based masking policies: \u0026ldquo;if a column is tagged PII, mask the value except for roles that have the right to see it.\u0026rdquo; It also creates the tags and unmasking roles.\ndbt, on its side, applies tags to columns when it creates models: \u0026ldquo;this email column is PII, this amount column is FINANCIAL.\u0026rdquo;\nThe result: when a marketing user runs SELECT * FROM clients, emails are automatically masked. The user with the right role sees the real values. Nobody had to write masking logic in SQL. It\u0026rsquo;s handled by the combination of infrastructure + metadata.\nAudit and Iterate # One of the most underrated benefits of this approach: auditability.\nWhen a security audit asks \u0026ldquo;who has access to what?\u0026rdquo;, the answer is in the code. Not in a Snowflake console where you have to navigate 15 pages. Not in a manually maintained Excel document. In the code, versioned, with the complete history of who changed what and when.\nAnd when you want to add a new team, a new schema, or modify permissions, it\u0026rsquo;s a standard process: modify the config, open a PR, review, merge, apply. The same workflow as for application code.\nLLM-Assisted # One last point worth mentioning: once your infrastructure is declared as structured code, an LLM becomes a remarkably effective assistant. You can ask it to create a new functional role, add a user, modify permissions, and it produces valid Terraform that follows your existing conventions.\nWith ad hoc SQL scripts, this is much riskier. The AI doesn\u0026rsquo;t know the current state of the infrastructure. With Terraform, the desired state is in the code, and plan validates that the result is correct before any application. The AI proposes, Terraform verifies.\nThe Real Gain # In the end, what changed isn\u0026rsquo;t the technology, it\u0026rsquo;s the confidence. You know that the state of your infrastructure matches the code. You know the permissions are correct. You know nobody made an undocumented change. And when someone asks you to add access or create a new schema, it\u0026rsquo;s 5 minutes of configuration instead of half an hour of SQL scripts with the fear of breaking something.\nSQL scripts are craftsmanship. Migrations are better-organized craftsmanship. Terraform isn\u0026rsquo;t perfect either: state can be fragile, plan errors are sometimes cryptic, and the Snowflake provider has its own bugs. But at least you know where you stand. And when something breaks, you know why.\n","date":"5 December 2025","externalUrl":null,"permalink":"/blog/snowflake-terraform-infrastructure-as-code/","section":"Blog","summary":"\u003cp\u003eThere\u0026rsquo;s a moment in every data engineer\u0026rsquo;s life when you find yourself staring at a 300-line SQL file that creates roles, grants, warehouses, and you wonder how you got here. This is my story.\u003c/p\u003e","title":"Snowflake + Terraform: Stop Managing Your Data Infrastructure in SQL","type":"blog"},{"content":"","date":"5 December 2025","externalUrl":null,"permalink":"/tags/terraform/","section":"Tags","summary":"","title":"Terraform","type":"tags"},{"content":"","date":"15 November 2025","externalUrl":null,"permalink":"/tags/aws/","section":"Tags","summary":"","title":"AWS","type":"tags"},{"content":"","date":"15 November 2025","externalUrl":null,"permalink":"/categories/backend/","section":"Categories","summary":"","title":"Backend","type":"categories"},{"content":"So I built this Planning Poker app. You know, that agile estimation thing where teams gather around and vote on story points? Yeah, I decided to make it web-based and real-time pokerplanning.net. And then\u0026hellip; well, let\u0026rsquo;s just say I got a little ambitious with the deployment setup.\nThe Idea # Planning Poker sessions are usually chaos in a conference room — someone\u0026rsquo;s writing on a whiteboard, someone else is yelling their estimate, and half the team\u0026rsquo;s zoning out. The problem\u0026rsquo;s simple: you need something that works instantly, doesn\u0026rsquo;t require sign-ups, and just\u0026hellip; works.\nSo I grabbed Go for the backend (fast, compiled, solid concurrency model), PocketBase as my all-in-one database and auth layer, and added htmx + Alpine.js on the frontend for that reactive feel without building a full React app. WebSockets for real-time updates. Simple.\nWhy PocketBase? (The WordPress Guy\u0026rsquo;s Take) # Here\u0026rsquo;s the thing — I spent years doing WordPress development. Loved it, hated it, the usual. When I transitioned into actual software development, I started doing the whole dance: pick an ORM, set up migrations, wire up authentication, build your router, handle database schema changes\u0026hellip; it\u0026rsquo;s exhausting.\nPocketBase felt like the Go equivalent of what WordPress was to me — a sensible all-in-one baseline. You get a database layer, authentication, an admin UI, migrations that actually work, and an HTTP router. It\u0026rsquo;s structured enough to move fast, but flexible enough to extend.\nI didn\u0026rsquo;t want to spend the next week messing around with Gorm, writing migration files, building auth middleware, and configuring a router. I wanted to ship something in a few days. PocketBase let me focus on the actual problem: making Planning Poker work in real-time.\nAnd if this ever turns into something I want to monetize as a SaaS, I can build on top of this foundation. The headless nature of PocketBase means I can eventually swap out the frontend, add a pricing layer, whatever. It\u0026rsquo;s a solid base.\nSQLite: The Unsung Database Hero # Real talk: I love Postgres. Seriously. But for this project? SQLite was the right call.\nMost people don\u0026rsquo;t realize this, but SQLite is the most deployed database engine in the world. Your phone probably has a dozen SQLite databases running right now. Android, iOS, Firefox, Chrome — all SQLite. It\u0026rsquo;s boring, reliable, and seriously underestimated.\nFor a Planning Poker app that doesn\u0026rsquo;t need horizontal scaling or complex multi-user transactions at massive scale, SQLite is exactly what you need. No separate database server to manage, no connection pooling headaches, no \u0026ldquo;is the DB down?\u0026rdquo; crisis calls at 3 AM. It lives in a file. You can back it up, version it, move it around.\nI made a deliberate choice this time: don\u0026rsquo;t over-engineer the database. The app doesn\u0026rsquo;t need the complexity of Postgres. SQLite handles 20,000 concurrent WebSocket connections on a t3.micro without breaking a sweat. That\u0026rsquo;s plenty.\nWhat It Does # The core loop is pretty straightforward:\nCreate a room, no login needed People join with a name You set up a voting round with Fibonacci or custom values Everyone votes at the same time Reveal and discuss Repeat There\u0026rsquo;s role-based stuff too — you can be a voter or just a spectator. Room creators can lock things down however they want. Everything syncs in real-time via WebSocket, so when someone votes, everyone sees it instantly (or sees that they voted, at least — votes are hidden until reveal).\nThe state management is clean. Rounds have states: voting, revealed, completed. Participants track who\u0026rsquo;s where. Votes live in the database. Rooms auto-expire after 24 hours so you\u0026rsquo;re not storing dead data forever.\nPerformance? Yeah, I Checked # Here\u0026rsquo;s where it gets a bit absurd. On a t3.micro (1 vCPU, 1GB RAM), this thing can handle:\n2,000-3,000 concurrent rooms 20,000-30,000 WebSocket connections That\u0026rsquo;s\u0026hellip; a lot more than you\u0026rsquo;d ever need for Planning Poker. But I wasn\u0026rsquo;t about to ship something that couldn\u0026rsquo;t handle its own success, right?\nI built in async broadcasting with non-blocking message delivery, per-client send channels with buffering, fine-grained locking. There\u0026rsquo;s monitoring endpoints so you can peek at real-time metrics. Slow clients get detected and cleaned up automatically. It\u0026rsquo;s way too complex for what it does, but it works.\nThe Deployment Rabbit Hole # And then I got to deployment.\nInstead of just throwing it on a server with SSH, I decided to go full enterprise mode. Here\u0026rsquo;s the setup:\nGitHub Actions watches for git tags Builds a multi-architecture Docker image Pushes to GitHub Container Registry Triggers an AWS Systems Manager Run Command EC2 instance pulls the image Docker Compose spins up the containers Health checks validate everything\u0026rsquo;s live Zero SSH keys exposed. No ports open except HTTP/HTTPS. Everything\u0026rsquo;s audited in CloudTrail. And yeah, I went full Terraform on the infrastructure — EC2, security groups, IAM roles, the whole thing.\nIs it overkill for a Planning Poker app? 100%. Could I have just SSHed into a box and ran it? Yeah. But this way, deploying is literally just git tag v1.0.0 \u0026amp;\u0026amp; git push origin v1.0.0. Two minutes later it\u0026rsquo;s live. And you\u0026rsquo;ve got an audit trail, automatic rollback capabilities, and infrastructure-as-code. So really, I\u0026rsquo;m just being thorough.\nTech Stack (The Real One) # Backend: Go 1.25, PocketBase 0.30 (which is essentially Echo + SQLite + admin UI in one binary) Frontend: htmx 2.0 for AJAX/WebSocket, Alpine.js 3.14 for interactivity, Templ for templating Database: SQLite (bundled in PocketBase) Deployment: Docker, Docker Compose, Terraform, GitHub Actions, AWS SSM Monitoring: Built-in metrics endpoint, health checks Why This Stack? # PocketBase was the real discovery here. Everyone wants to build a backend, but PocketBase just gives you one. Database, migrations, admin UI, auth, all of it. I just had to wire up the WebSocket hub and business logic. That\u0026rsquo;s time not spent messing around with boilerplate.\nhtmx + Alpine is underrated for this kind of project. No build pipeline headaches, no JavaScript framework fatigue, just declarative HTML attributes that do what you\u0026rsquo;d expect. Progressive enhancement, hypermedia, all that good stuff. You write less code, it\u0026rsquo;s easier to follow, and your frontend doesn\u0026rsquo;t become a maintenance nightmare in six months.\nAnd Go\u0026rsquo;s goroutines made the WebSocket hub trivial. Broadcasting messages to thousands of connections? Goroutines with channels. Done. That\u0026rsquo;s the real advantage of Go for this use case — concurrency that doesn\u0026rsquo;t drive you crazy.\nThe Real Challenge # Honestly? Getting the state machine right. Rounds need to flow properly: voting → revealed → completed → new round. Participants need to stay synced. Connections drop and reconnect — you can\u0026rsquo;t lose someone\u0026rsquo;s vote because their WiFi dropped out.\nThat took more thought than the deployment did. The infrastructure was just\u0026hellip; doing what infrastructure does. The hard part was making sure the voting logic was solid.\nSo Is It Done? # Yeah, it works. You can run make dev and it spins up locally with live reload. There\u0026rsquo;s integration tests. It\u0026rsquo;s got metrics, health checks, proper error handling. You could actually use this to run Planning Poker sessions right now.\nIs the deployment way too complex for a Planning Poker app that probably won\u0026rsquo;t ever break a sweat at scale? Absolutely. But hey, it\u0026rsquo;s there, it works, and now I\u0026rsquo;ve got a template for deploying Go apps to AWS without touching SSH. That\u0026rsquo;s worth something, right?\nThe real lesson though? Don\u0026rsquo;t over-engineer the database. SQLite did the job. PocketBase saved me from spending three days on boilerplate. Go\u0026rsquo;s concurrency primitives made the hard parts easy. And sometimes the best infrastructure is the one that gets out of your way and just works.\nYou could actually use this to run Planning Poker sessions right now at pokerplanning.net.\n","date":"15 November 2025","externalUrl":null,"permalink":"/blog/planning-poker-app/","section":"Blog","summary":"\u003cp\u003eSo I built this Planning Poker app. You know, that agile estimation thing where teams gather around and vote on story points? Yeah, I decided to make it web-based and real-time \u003ca\n  href=\"https://pokerplanning.net/\"\n    target=\"_blank\"\n  \u003epokerplanning.net\u003c/a\u003e. And then\u0026hellip; well, let\u0026rsquo;s just say I got a little ambitious with the deployment setup.\u003c/p\u003e","title":"Building Planning Poker: A Real-time Collaboration App (That I Might Have Over-Engineered)","type":"blog"},{"content":"","date":"15 November 2025","externalUrl":null,"permalink":"/categories/development/","section":"Categories","summary":"","title":"Development","type":"categories"},{"content":"","date":"15 November 2025","externalUrl":null,"permalink":"/fr/categories/d%C3%A9veloppement/","section":"Categories","summary":"","title":"Développement","type":"categories"},{"content":"","date":"15 November 2025","externalUrl":null,"permalink":"/tags/devops/","section":"Tags","summary":"","title":"DevOps","type":"tags"},{"content":"","date":"15 November 2025","externalUrl":null,"permalink":"/tags/go/","section":"Tags","summary":"","title":"Go","type":"tags"},{"content":"","date":"15 November 2025","externalUrl":null,"permalink":"/tags/htmx/","section":"Tags","summary":"","title":"Htmx","type":"tags"},{"content":"","date":"15 November 2025","externalUrl":null,"permalink":"/tags/pocketbase/","section":"Tags","summary":"","title":"PocketBase","type":"tags"},{"content":"","date":"15 November 2025","externalUrl":null,"permalink":"/tags/websockets/","section":"Tags","summary":"","title":"WebSockets","type":"tags"},{"content":" Professional Experience # Groupe Alesco # Lead Data \u0026amp; AI Engineer | ETL Pipelines · MLOps | Analytics Platform # October 2025 - Present | Saint-Hubert, Quebec\nModernization and industrialization of data infrastructure to support analytics and artificial intelligence initiatives. Transformation of manual processes into automated pipelines and implementation of an accessible, high-performance analytics platform.\nTechnical Expertise::\nData pipeline architecture and automation (ETL/ELT) Analytics platform implementation and optimization (Snowflake) Machine learning and generative AI model operationalization Democratization of data access for business and analytics teams Full Stack Software Engineer | PHP/Symfony · AWS | Financial Solutions # February 2024 - October 2025 | Saint-Hubert, Québec\nDesign and development of high-performance web applications within an agile team in the alternative financing sector. Active participation in architecture, development, and optimization of critical platforms for customer experience.\nTechnical Expertise:\nLAMP stack (Symfony, MySQL, Redis, Nginx) and AWS ecosystem Frontend development with React/Next.js Scalable and secure solution architecture Third-party API integration and performance optimization Mentorship and knowledge sharing within the team Optable # Backend Software Engineer | Go · gRPC · GCP | Data Collaboration Platform September 2022 - January 2024 | Montreal, Quebec, Canada\nDevelopment and optimization of the backend for an innovative privacy-first data collaboration platform. Contributing to a solution that meets the highest industry standards in quality, security, and performance.\nTechnical Expertise:\nBackend development in Go with gRPC architecture Cloud infrastructure (GCP) and Infrastructure as Code (Terraform) Relational and analytical databases (Postgres, BigQuery) Frontend development with React.js Automated testing and rigorous code reviews Microservices architecture and high-performance APIs Vortex Solution # Backend Software Engineer | PHP · WordPress · Vue.js | Custom Web Applications May 2020 - August 2022 | Montreal, Quebec, Canada\nDevelopment of custom web applications for modern institutional websites within a renowned web agency. Specialization in creating complex solutions tailored to specific client needs, with a focus on performance and user experience.\nTechnical Expertise:\nDevelopment of advanced search engines and SSO systems External API integration and authentication systems Performance optimization and large-scale data management Interactive module development with Vue.js Custom WordPress solution architecture PHP/JavaScript stack for scalable web applications Ekkip boutique sport # Full Stack Software Engineer \u0026amp; IT Manager | Web Development · Infrastructure · Digital Marketing July 2017 - May 2020 | Montreal, Canada\nComplete management of the technology ecosystem for a dynamic sports retail company. Responsible for web development, IT infrastructure, and digital marketing strategies, combining technical expertise with creative vision to support company growth.\nTechnical and Strategic Expertise:\nDevelopment and maintenance of main website Creation of custom internal software to optimize operations Computer network management and security Web design and graphic design (web, print, advertising) Development and execution of digital marketing strategies Video content production and multimedia creation Technical support and IT infrastructure Early Career in France # Web Developer \u0026amp; Digital Marketing Specialist August 2012 - July 2017 | Various locations in France\nFreelance and agency roles in web development, SEO, and digital marketing. Built foundational expertise in PHP/JavaScript development, WordPress, frontend integration, e-commerce, webmarketing, and multimedia production across multiple client projects and companies.\nEducation # University of Montreal # D.E.S.S Fine Arts and Creative Technologies 2016 - 2017\nVR/AR development, interactive media, and creative technologies with focus on Unity, videomapping, and Arduino/Raspberry Pi integration.\nUniversity of Toulon, France # Licence Ingémedia (Bachelor) - Information \u0026amp; Communication 2014 - 2015\nDesign and multimedia communication.\nDUT Services et Réseaux de Communication (Associate Degree) 2011 - 2014\nDistributed systems architecture, software development, and project management.\n","externalUrl":null,"permalink":"/resume/","section":"Damien GOEHRIG","summary":"Data engineer with 10+ years of software development experience in Go and PHP, now specializing in ETL pipeline architecture, analytics platform modernization, and MLOps. Strong technical foundation in building scalable data infrastructure and distributed systems. Expertise in transforming manual processes into automated, production-grade data solutions.","title":"Damien GOEHRIG","type":"resume"},{"content":"","externalUrl":null,"permalink":"/stack/","section":"Stack","summary":"","title":"Stack","type":"stack"}]