Why Clinical Trial Data Cleaning Must Go Continuous (And How To Do It)
By John Oncea, Chief Editor, Clinical Tech Leader

Much like the Six Million Dollar Man, data has become better, stronger, and faster, driven by the proliferation of wearables, ePRO, eSource, and connected systems. But just because we have more data doesn’t mean it's good data. Much of what is being mined is of poor quality, slowing submissions, weakening conclusions, and creating compliance risk.
That’s where data cleansing – or cleaning or scrubbing, no judgment here – enters the discussion. A structured, often iterative process of detecting and correcting (or removing) corrupt, inaccurate, incomplete, or incorrectly formatted data from a dataset, data cleansing is a critical component of data management that ensures data quality for analysis.
Truly effective data cleansing is a continuous process that spans collection, review, reconciliation, and database lock, and must account for multiple data streams, not just traditional EDC data. Ensuring data is trustworthy involves querying site staff, auditing data, and validating entries against protocol, all while following GCP guidelines to prepare for statistical analysis, typically by reviewing EDC entries, detecting inconsistencies, and issuing queries to site staff for resolution.
Despite its importance, many organizations treat data cleansing as an annoying checkbox before the “real” analytics work begins. The reality is, when done right, data cleansing ensures that producing quality data becomes an ongoing process with feedback loops, dashboards, and clear ownership instead of a one-off cleaning phase.
The Data Deluge Has Raised The Stakes
The volume, variety, and velocity of clinical trial data have grown dramatically over the past decade. Research tracking digital health technology use across clinical trials has identified more than 2,300 unique trials incorporating these tools, with wearables – particularly smartwatches and wrist-worn sensors – representing the most widely used category, according to the National Library of Medicine (NCBI).
Unlike traditional site visits that offer only snapshots of treatment efficacy, digital instruments can transform medical assessment into continuous or intermittent real-life tracking in a patient’s normal environment, with symptoms fluctuating week to week, day to day, and even within a single day.
That continuous stream of incoming data creates a cleaning and review challenge unlike anything the industry faced when paper CRFs were the norm. Not all data collected is equally useful, and quality-by-design thinking has become essential to focusing teams on what truly supports the study hypothesis.
Studies have reported data error rates ranging from as low as 0.14 percent with double data entry to over 6 percent with more manual methods. One trial demonstrated that introducing real-time validation dramatically reduced data-entry errors from 0.3 percent to 0.01 percent, according to NCBI. Those numbers have direct consequences for timelines, submissions, and patient safety.
What “Clean” Actually Means In A Modern Trial
The common error types that drive cleaning activity are familiar to anyone who has worked a database lock: data entry errors such as typos, incorrect values, and missing fields; logical inconsistencies involving impossible dates or mismatched visit timelines; duplicate or missing records; source-to-eCRF discrepancies; and integration inconsistencies between EDC, labs, ePRO, or EHR systems. Each carries clinical consequences beyond the data itself: slower review cycles, additional queries, protocol deviations, and risk to analysis integrity.
Regulatory authorities, including the FDA and the European Medicines Agency, highlight that data accuracy, integrity, and traceability are mandatory elements of Good Clinical Practice, making effective data cleaning a regulatory expectation rather than simply an operational activity.
At its core, effective data cleaning relies on explicit, human-readable rules; not smart code buried in opaque pipelines, but a shared library of business logic that explains what is being enforced and why. That is where data quality and governance align: when non-technical stakeholders can read, challenge, and evolve the rules rather than relying on tribal knowledge passed down through analyst handoffs.
Risk-Based Cleaning Is Now A Regulatory Expectation
The FDA’s adoption of ICH E6(R3) incorporates flexible, risk-based approaches and embraces innovations in trial design, conduct, and technology, marking what the agency describes as a significant evolution in the global clinical trial landscape. The guideline calls for oversight proportionate to risk, moving away from one-size-fits-all monitoring in favor of centralized monitoring, targeted oversight, and adaptive approaches, while strengthening expectations for data governance, including audit trails, metadata, traceability, and secure system validation.
The Association of Clinical Research Professionals ICH released E6(R3) in January 2025, with FDA adoption following in September 2025, reflecting modern approaches including decentralized trials, electronic data collection, and risk-based monitoring. The EMA’s adoption took effect in July 2025, and Health Canada followed in April 2026. These are not aspirational guidelines. They are the new operating environment.
Translating that into practice means adopting a tiered framework that maps directly to ICH E6(R3)’s risk-proportionality principles. Tier one covers critical data – primary endpoints and safety variables – and requires real-time edit checks, frequent cross-site review, and zero tolerance for unresolved discrepancies at interim analysis. Tier two covers important secondary endpoints and key supporting variables, warranting periodic review and targeted queries when trends emerge. Tier three covers non-critical data points that need minimal intervention unless patterns develop across sites or time windows. This model is not a shortcut; it is a deliberate allocation of finite review capacity toward the data that actually determines what a trial can claim.
Consider a Phase II study tracking HbA1c as a primary endpoint. A sponsor operating under this framework might require less than one percent missingness and zero unresolved discrepancies at each interim checkpoint. A decentralized trial collecting wearable heart rate data, meanwhile, might set timestamp alignment as a Tier 1 check – because a systematic misalignment between device clocks and EDC timestamps, if caught late, can invalidate endpoint derivation entirely. Catching it early via centralized anomaly review is the difference between a correction and a crisis.
Who Owns What – And Why That Question Matters
Teams often agree that data quality is everyone’s responsibility, which in practice means it belongs to no one. Clearer ownership changes that. Data management defines the business rules, oversees query workflows, and maintains the Data Management Plan. Clinical operations owns site performance, query resolution timelines, and training. Biostatistics identifies the critical variables and sets tolerance thresholds for missingness and discrepancy rates at interim and final analysis. Medical monitors assess the clinical relevance of flagged discrepancies and escalate anything that touches patient safety. When those lanes are defined before enrollment opens, the feedback loops that make cleansing effective actually function.
Automated checks, integrated systems, audit trails, and centralized dashboards reduce manual effort while improving traceability. Research into AI-assisted data review has demonstrated that human-AI collaboration can improve both the speed and accuracy of medical data review compared to traditional methods, with large-scale trials generating millions of data points that make manual review alone increasingly unrealistic, according to arXiv. More advanced approaches, including anomaly detection and cross-system reconciliation, help teams manage the volume and velocity of today’s trial data.
But technology cannot substitute for strategy. You cannot fix poor data quality with more sophisticated models. You get elegant wrong answers. Progress happens when organizations begin measuring quality with concrete metrics and clear thresholds – query rate per subject or CRF page, median query resolution time targeting under five days, percentage of critical variables with zero outstanding queries at interim checkpoints, missing data rates for primary endpoints, and protocol deviation rates tied to data issues. Dashboards that surface those metrics across sites in real time give clinical operations the visibility to intervene before problems compound.
A Monday-Morning Playbook
Strong data cleansing programs are not built on good intentions alone. They are built on documented decisions made before the first patient is enrolled. A practical starting point for any data management lead looks something like this.
First, define the critical data and processes tied directly to primary endpoints and safety variables; these are the non-negotiables that will drive regulatory review.
Second, map every data source in the study, including EDC, ePRO, labs, wearables, and EHR feeds, and identify the reconciliation points between them.
Third, build a data review plan that specifies frequency, named owners, and escalation thresholds for each tier of the risk framework.
Fourth, implement real-time edit checks for high-risk fields only, because over-querying sites is itself a data quality problem that creates noise and erodes site trust.
Fifth, establish the quality metrics that define what “good” looks like for this specific protocol.
Sixth, configure centralized review dashboards that allow cross-site trend monitoring without requiring per-site manual aggregation.
Seventh, document every decision in a Data Management Plan aligned with ICH E6(R3) expectations, because that document is both the operating guide and the audit trail.
Clean Data Is The Foundation Of Scientific Credibility
In the end, data cleansing is not just about removing errors. It is about protecting the scientific value of the trial, safeguarding patient safety, and ensuring the final dataset can withstand regulatory scrutiny. Inaccurate or unreliable data can lead to incorrect conclusions, potential harm to patients, and regulatory non-compliance, while efficient cleaning processes provide a solid foundation for decision-making and contribute to overall trial success.
The sponsors and CROs that treat cleansing as a designed-in discipline – not a remediation phase – will lock databases faster, absorb fewer inspection findings, and produce evidence that regulators and clinicians can actually use. In a world where trial complexity and data volume are only increasing, that is not a competitive advantage. It is the baseline.