The exchange of mass amounts of data is critical for the majority of business processes today, enabling innovative customer experiences at scale. But quickly getting pristinely-clean, high-quality data where it needs to be—whether to an in-house system or to external partners—is a big challenge for data teams. And to do so in real time is even more complex. Moving data securely, reliably, and quickly requires good data governance—but what kind of frameworks are required to ensure data is well-governed through real-time distribution within the organization?
At Capital One, we set off on a tech transformation over a decade ago that required us to modernize our data ecosystem on the cloud. We have built—and will continue to evolve—a central, foundational data ecosystem that enables teams across the company to leverage and share well-governed data across the organization. Good governance has played a crucial role in modernizing our data ecosystem, and this makes governance even more critical today.
The best practices outlined below can help companies enable their teams to leverage data in a well-governed fashion by focusing on implementing central data standards and platforms with built-in data governance.
Build a Central, Self-Service Portal
To ensure data remains well-governed throughout its lifecycle, start by building a central
hub where data from all your separate repositories can be accessed in one place. From here, you can set up multiple pipelines with rules, restrictions and policies dictating data accessibility, data velocity (e.g., whether data is streamed or not), schema enforcement, data quality, and more. This self-service portal should allow your organization to virtualize all data sources into a single, unified data layer. This provides a bird’s-eye view of your data landscape, making it easier for users to access and use while implementing governance controls around data access, privacy, security and more. Having this centralized self-service portal is key to federating data out across the company.
Establish Quality-of-Service Governance
Whether data will be shared in real-time or asynchronously, it’s important to ensure that all data adheres to the governance defined based on its sensitivity and value. Even data that may not seem necessary to access in real-time today could become critical in the future. From the onset, you should apply varying levels of governance and controls around access and security depending on the data. This means applying rigor around governance at the beginning of the data lifecycle, which might include robust data quality monitoring, lineage tracking, and security controls, depending on value and sensitivity of the data. That way, any dataset can easily be surfaced and shared as requirements evolve, without costly refactoring later on.
Publish Once, Publish Right
When data moves in milliseconds, strong governance ensures that it flows to the right places through the right rules at the right time. Make sure to establish rules about when and where data is published, and to which applications it becomes available, but also to establish monitoring and observability. Teams need confidence their data will be available for specific critical use cases exactly when they need it, whether that’s in real time or asynchronous. At Capital One, the use of real-time data helps detect fraud and enable fast, secure transactions—but batch data is still needed to power use cases and drive AI/ML at scale.
Make Data Traceable and Auditable
Transparency is critical when setting up a data governance structure. Teams need to be able to monitor and audit all data flows to ensure compliance with governance frameworks, identify potential issues, ensure data security, and improve overall efficiency.
This is where your centralized data hub comes back into play, providing granular publish and subscribe capabilities so the owners of the data can monitor which datasets get shared with which teams and under which parameters. You can set service level agreements (SLAs) around data freshness requirements. In addition, observability tooling enables data teams to monitor whether SLAs are being met across data pipelines.
Invest in the Right Storage
To make wide-scale data sharing possible, companies need to invest heavily in the right storage and infrastructure. Most data lakes and warehouses also allow users to toggle levels of access and monitoring for specific datasets. Make sure to check on the level of controls and monitoring offered by your vendors of choice. Not all data needs to be stored in the highest performance (and highest cost) warehouses all the time — some data can be stored more economically in data lakes if it doesn’t need to be accessed and shared in real-time. Even within the context of real-time data, there are mechanisms to trade off cost and performance. The key is to establish smart governance mechanisms to intelligently move data across storage tiers based on access requirements and use cases through the establishment of quality of service and SLAs that define latency, retention, and cost tolerance.
Another tip when balancing cost and performance is to ensure all data is tagged with good metadata, such as required retention periods, time since last access and usage patterns. This metadata allows us to automatically move data into different storage tiers — keeping some data in accelerated tiers, while archiving other data to cheaper storage. This multi-tier approach also ensures all data, no matter its current usability, is stored and findable for future use. You never know when data that seems unimportant today will become important tomorrow.
By taking a strategic approach to data governance upfront, an enterprise can unlock the full potential of their data at scale. Users can find, access, and use data quickly, securely and reliably to power real-time applications and critical decision-making. While implementing robust data governance is a significant investment—and tight cooperation between data, business, and leadership teams—the competitive advantages of being a truly data-driven organization make the effort worthwhile.
About the author: Marty Andolino, VP of Engineering, Enterprise Data Technology at Capital One. In his role, Marty leads a team responsible for data pipelines, data governance services, and external data sharing. Having been with Capital One for more than nine years, he has held various tech roles across retail, marketing, fraud, data, decisions, and architecture. He is passionate about building a positive customer experience, innovative technology solutions, and mentoring.
Related Items:
Unlocking the Full Potential of Data: The Crucial Role of Data Governance in Integrated Analysis
The Rise and Fall of Data Governance (Again)
Building a Successful Data Governance Strategy
The post Want to Build a Data-Driven Business? Start with Good Governance appeared first on BigDATAwire.
Leave a Reply