Data Cleansing - The Process of Data Cleansing

The Process of Data Cleansing

  • Data auditing: The data is audited with the use of statistical and database methods to detect anomalies and contradictions: this eventually gives an indication of the characteristics of the anomalies and their locations. Several commercial software packages will let you specify constraints of various kinds (using a grammar that conforms to that of a standard programming language, e.g., JavaScript of Visual Basic) and then generate code that checks the data for violation of these constraints. This process is referred to below in the bullets "workflow specification" and "workflow execution." For users who lack access to high-end cleansing software, Microcomputer database packages such as Microsoft Access or FileMaker Pro will also let you perform such checks, on a constraint-by-constraint basis, interactively with little or no programming required in many cases.
  • Workflow specification: The detection and removal of anomalies is performed by a sequence of operations on the data known as the workflow. It is specified after the process of auditing the data and is crucial in achieving the end product of high-quality data. In order to achieve a proper workflow, the causes of the anomalies and errors in the data have to be closely considered.
  • Workflow execution: In this stage, the workflow is executed after its specification is complete and its correctness is verified. The implementation of the workflow should be efficient, even on large sets of data, which inevitably poses a trade-off because the execution of a data-cleansing operation can be computationally expensive.
  • Post-processing and controlling: After executing the cleansing workflow, the results are inspected to verify correctness. Data that could not be corrected during execution of the workflow is manually corrected, if possible. The result is a new cycle in the data-cleansing process where the data is audited again to allow the specification of an additional workflow to further cleanse the data by automatic processing.

Read more about this topic:  Data Cleansing

Famous quotes containing the words process, data and/or cleansing:

    We tend to be so bombarded with information, and we move so quickly, that there’s a tendency to treat everything on the surface level and process things quickly. This is antithetical to the kind of openness and perception you have to have to be receptive to poetry. ... poetry seems to exist in a parallel universe outside daily life in America.
    Rita Dove (b. 1952)

    To write it, it took three months; to conceive it three minutes; to collect the data in it—all my life.
    F. Scott Fitzgerald (1896–1940)

    For even satire is a form of sympathy. It is the way our sympathy flows and recoils that really determines our lives. And here lies the vast importance of the novel, properly handled. It can inform and lead into new places our sympathy away in recoil from things gone dead. Therefore the novel, properly handled, can reveal the most secret places of life: for it is the passional secret places of life, above all, that the tide of sensitive awareness needs to ebb and flow, cleansing and freshening.
    —D.H. (David Herbert)