Data cleaning, otherwise known as data cleansing or data scrubbing involves detecting and correcting corrupt, inaccurate, incomplete, and irrelevant records in a record set, table or database. The coarse data is then replaced, modified, or deleted. Partnering with experienced data cleansing companies help businesses maintain complete, accurate and up-to-data database. Microsoft Excel is the most popular software used to store valuable data. The stored data must be clean and reliable. By cleansing data, business enterprises can boost their customer acquisition efforts, improved decision making process, streamline business practices, save a lot of money, and ensure that employees are making the best use of their work hours. Clean and actionable data prevent the staff from contacting customers with out-of-date information or creating invalid vendor files in the system. Since the records are clean, it helps maximize the staff’s efficiency and productivity, and improves their response rates.
According to a survey conducted by Pharma Intelligence and sponsored by Oracle Health Sciences, lack of confidence in clinical trial data cause delay in drug development. Besides trial delays, 81 percent of respondents pointed out data governance issues as the biggest challenge in meeting regulatory compliance. The survey respondents said that the top three data issues were duplicate data/inconsistent data, data quality and data integrity/traceability. 51 percent cited data completeness, 45 percent said data quality and 43 percent, data cleaning are the top three operational challenges that were with the clinical trial data. Therefore, companies should focus more on improving quality of their data.
Some Techniques to Clean Data in Excel
The following are some of the techniques that can be used to clean data in Excel
-
- Remove Extra Spaces– Extra spaces between words can be removed with the help of trim function.
Syntax: =TRIM(Text)This function takes one single argument which could either be the text which you type manually or it could be the cell reference. Trim function removes all the leading spaces and trailing spaces and extra spaces between words except one single space that is allowed.
Select and treat all blank cells – If you want to fill blank cells with ‘0’ or ‘Not Available’ or simply want to highlight it, select the entire data set; press F5, this opens the “Go To” dialogue box and click on special button at the bottom left. This again opens the “Go To Special” dialogue box. Then click blank and click ok. This selects all the blank cells in your data set at the same time. Now you have these cells in grey and the first cell is in white as this is the active cell. Then start typing “0” or “Not Available” in all these cells and hit “Control + Enter”.
- Remove Extra Spaces– Extra spaces between words can be removed with the help of trim function.
-
- Convert Numbers Stored as Text into Numbers – When data is imported from text files or external databases, numbers get stored as text. Some people use an apostrophe before a number to make it text. This can lead to errors if you are using these cells in calculations. Here is how to convert these numbers stored as text back into numbers. Type 1 in any blank cell and then select the cell where you typed 1 and press Control + C. After that, select the cell or range which you want to convert to numbers. Select Paste – Paste Special or press Alt + E + S. At that time “Paste Special” dialogue box appears and select “Multiply” from the box. Click OK and this converts all the numbers in text format back to numbers.
-
- Remove Duplicates – This involves highlighting duplicate data and deleting it. To highlight duplicate data, select the data and go to Home, select Conditional Formatting, click Highlight Cells Rules and select Duplicate Values. Specify the formatting and all the duplicate values get highlighted. To remove duplicates, select the data and Go to Data and click Remove Duplicates. If the data has headers, ensure that the checkbox at the top right is checked. Select the Columns from which duplicates have to be removed and click OK. This removes duplicate values from the list.
-
- Highlight Errors – You can highlight Errors in Data in Excel by using Conditional Formatting and Go To Special.Using Conditional Formatting – Select the entire data set. Then go to Home and click Conditional Formatting and select New Rule. In New Formatting Rule Dialogue Box select ‘Format Only Cells that Contain’ and in the Rule Description, select Errors from the drop down.Set the format and click OK. This highlights any error value in the selected dataset.
- Using Go To Special – Select the entire data set and press F5, this opens the Go To dialogue box and click on Special Button at the bottom left. Select Formulas and uncheck all options except Errors. This selects all the cells that have an error in it.
- Change Text to Lower/Upper/Proper Case – When a workbook or data is imported from text files, sometimes, all the text could be in lower/upper case or a mix of both. However, it can be made consistent by using these three functions:LOWER() – Converts all text into Lower Case.
UPPER() – Converts all text into Upper Case.
PROPER() – Converts all Text into Proper Case.
-
- Parse Data Using Text to Column – When data is obtained from a database or imported from a text file, it may happen that all the text is cramped in one cell. It is possible to parse this text into multiple cells by using Text to Column functionality in Excel. For that, select the data/text you want to parse, Go to Data, click Text to Column, this opens the Text to Columns Wizard.Select the data type and click “Delimited” if your data in not equally spaced, and is separated by characters such as comma, hyphen, dot and click Next. Select Delimiter (the character that separates your data) and select pre-defined delimiter or anything else using the Other option and click Next. Then select the data format and the destination cell. If destination cell is not selected, the current cell is overwritten.
-
- Spell Check – In order to run a spell check for your data set, use the keyboard shortcut F7.
-
- Delete all Formatting – Select the data set and go to Home and click Clear and select Clear Formats.
-
- Use Find and Replace to Clean Data in Excel – Find and replace is crucial when it comes to data cleansing. It allows you to select and remove all zeros, change references in formulas, find and change formatting etc.
The above techniques will ensure clean or accurate data in your Excel sheets. Data cleansing tools such as OpenRefine, WinPure, Trifacta Wrangler, Data Ladder IBM Infosphere Quality Stage also can help keep your data clean. A practical way to enhance your data integrity and work with accurate data is to utilize data cleansing services provided by a reliable data cleansing company.