Data is one of the most important assets in the 21st century. Going forward, nearly every business transaction and engineering decision will be based on complex data sets.
In addition, it seems like everything in our society is producing a significant amount of data. So how do we manage all of it to make the best decisions possible? Controlling how data is captured and validated is a critical first step.
In this article, we discuss both data entry best practices and data validation best practices.
What is Data Entry?
Data entry is simply the act of capturing data. Traditionally, it has has referred to the manual capture data. However, it's meaning is evolving as automation becomes more common. In 2018, some data is captured automatically but a lot of data is still entered manually.
Data Entry Management
Automating the capture of data decreases the errors on data input significantly. In addition, automation allows us to also validate data as it's captured. However, having a small error in an automated capture system can cause significant problems in our data sets. More on that later. Therefore, we should design our automated systems to be as error free as possible.
In contrast, entering data manually is a very time consuming process. This can also make it expensive. Additionally, there's usually a lot of human error. Even though we have lots of automation tools at our fingertips, a lot of data input is still manually done.
Human error is a significant challenge for data capture and validation. For this reason, I'm a huge advocate for automation.
Data Entry Error and the Importance of Data Accuracy
Data entry error is a significant problem. The integrity of our data sets is vital for the accuracy of decision making based on that data. Without sufficient data accuracy, significant mistakes can be made. Ensuring that data is captured accurately is critical.
For this reason, it's important to implement best practices when capturing data. We'll cover best practices below.
What is Data Validation?
Data validation is the set of processes to ensure that data is of the highest quality and also that it has its "original meaning." In essence, its purpose is to prevent data entry error.
Quite simply, data validation is about making sure that data is correct and useful. This objective is accomplished by having validation rules and constraints when data is initially captured and stored. Data validation software can make this a lot easier.
Ultimately, data validation is intended to check for the correctness and meaningfulness of the data being captured.
Data Entry Best Practices
Let's discuss some incredible data entry tips and tricks.
1. Use a Data Standard
One important practice for good data entry is implementing good data standards. Errors in data entry entry are greatly reduced when good standards are used. It's fairly easy to identify and implement data entry accuracy standards that make sense for your situation.
For example, here are a few questions that should be addressed by a standard:
- What types of data are going to be entered?
- How familiar is the operator(s) with the data?
- Are there any special character(s) that are going to be used for missing values?
- How will data be formatted?
- What is the best order for the data columns?
2. Implement Data Entry Rules
Next, creating a set of rules and restrictions on data entry is important. This practice prevents operators from making input mistakes and helps to minimize data entry error.
An example includes verifying the data type (string, integer, Boolean, etc.) is correct before data is accepted. Another example is verifying that the data that was entered matches certain requirements. We've all seen this in practice whenever you enter an email address in a website form and get an error message because you forgot the @ sign.
Another example is that certain inputs may only accept integer values and not text. If an operator makes a mistake and tries to input the wrong type of data, the system will give a warning and prevent this from happening.
In Microsoft Excel and Access, formatting cells and inputs is easy.
3. Use Descriptive Column Names
Another good practice is using descriptive names for the columns in a data table. In other words, make sure you label your column headings properly. The headings should be descriptive enough so that everyone that uses the data knows exactly what the column is for.
Using good headings and labels becomes even more critical the larger and more complex the data set becomes.
Imagine having hundreds of columns. Now imagine if most of the columns have vague headings. In that situation, we greatly increase the probability that different operators will interpret the data set differently. Additionally, the data that is inputted by different operators can vary, contaminating the data set.
In essence, using clearly defined column names can prevent potential problems later down the road.
4. Use Error Detection
Implementing error detection early on for data entry can save a lot of time and money later on. As we accumulate more and more data that is not standardized and formatted properly, we also increase the amount of time and money it takes to clean up the mess.
Data entry accuracy is paramount to any data system, as you want to avoid data entry error as much as possible. Think of it this way: every time data is captured with an error, there's a direct future cost associated with it. This cost could be less than a penny or even 1000's of dollar. It depends on the situation.
With smaller data sets, the cost of errors during data capture is minimal. But not always. However, as data sets become larger, the cost of errors can grow exponentially. It’s typically a lot cheaper (and easier) to implement error detection early on.
Because of this reason, we should ensure that we have capabilities in place to detect errors before we even start data inputs. Otherwise, the errors that we have to deal with during the data processing and analysis phases may lead to enormous extra costs.
Creating automated data entry systems is a great way to improve data integrity. Automated systems allow rapid processing of large volumes of data. They can also save you a lot of money. Moreover, automation can prevent human error.
However, if not designed properly, automated systems may also cause significant data errors in our database.
One of the best approaches is to automate certain aspects of the data entry process, yet still use an operator to monitor and handle the more sensitive and complex data entry steps.
A well-designed data entry automation system allows one operator to do the work of 10+ with more data accuracy and data integrity. I've achieved this result numerous times.
Time for a little humor. I love this video by the Onion:
Data Validation Best Practices
Typically, data validation takes place after data entry has occurred. However, a well designed data capture process will include validation during the data entry process. That way you can give instant feedback to the operator. Some of the best practices for data entry discussed above also included data validation concepts.
Data Validation vs Data Verification
Overall, some examples of data validation include data type validation, cross reference verification, cleaning, reformatting and reshaping. There's a lot of overlap related to data validation vs data verification, which causes confusion. They are very similar.
1. Manage Blank Fields
Null, or empty fields can cause a lot of problems in your data set. Therefore, handling empty fields is critical.
For example, set strict rules for missing values as this will make your data much more robust and durable. A great practice is to missing values with NA or Null. This approach makes data analysis much easier, especially when dealing with numbers.
A common mistake is to use 0 for a blank field. This can cause significant issues if that field is expecting a number value. Imagine trying to do statistics on a data set that has lots of 0’s that were not really 0's. It can greatly change the results!
Using values like “0”, “-9” or “9999” for empty fields is a major mistake. Here’s why: Analysis tools and methods are going to treat these values as valid numeric values, which will skew the results.
Here's at a quick example. Let’s say we have a series of numbers in our data: [2, 5, 8, 0, 4, 2, 9, 5, 6, 3, 7, 5, 4, 8, 0]. If we calculate the mean of this series, we get 4.53. In essence, the two 0’s are calculated into the mean.
However, what if the two 0’s were actually empty fields (NA/Null) and 0 was input as a default? Now let’s recalculate the mean on the same series of numbers: [2, 5, 8, Null, 4, 2, 9, 5, 6, 3, 7, 5, 4, 8, Null]. The correct mean value is 5.23 (not 4.53)! That’s an error of 15.5%.
This is just a simple example with a small set of data. Errors like that can magnify significantly. In addition, when dealing with a large data set, a tiny data input error can result in serious problems.
This same logic can be used for string and text files as well. How you handle empty data input is critical to the health of your data sets, so manage your blank fields properly.
2. Ensure Good Data Cleaning
Data cleaning is the most important part of data analysis. The majority of data captured needs some sort of massaging. In addition, because data cleaning can be a complex process based on the situation, it can be hard to automate.
An example of data cleaning is detecting missing values in your data set and normalizing them. Overall, there are different methods to deal with missing numeric vs. string values.
For missing numeric values, a common practice is to fill missing data with the mean of its column. By filling the missing values with the mean, we keep the rest of the rows and do not lose any information. In contrast, another way of handling missing values is to remove the rows with the missing values completely.
However, this method has many disadvantages. Here's why: this approach causes us to lose some of our data, which may be very important for our analysis. For data science and machine learning situations, this causes us to have less training and test sets, which may cause high bias during future analysis.
Additionally, removing rows from the original data may cause significant biases in the results.
3. One Data Point Per Column
Another important practice is making your data tables follow the concept of First Normal Form (1NF).
Quite simply, First Normal Form is a fancy term related to relational databases. However, capturing data and validating it appropriately is critical as we scale our data up.
The main takeaway of First Normal Form is that for each data column you have, only allow one data point to be in it. Otherwise, processing and analyzing data becomes a lot more challenging later on.
In addition, if 1NF is not practiced, creating data tables that are connected (relational databases) becomes very inefficient if not error prone.
Quick Example of 1NF
Let's walk through a quick example to illustrate the importance of First Normal Form. In this example, let's say you have a database for your sales leads. In one of the tables, you have a column for the phone # for each sales lead. Each row or cell should have a single phone #, right?
If you said, yes, then you're practicing 1NF. When capturing multiple phone #’s for each contact, use additional columns. For example, create separate columns for home #, cell #, and work #.
If instead, you added 3 phone #'s in the same column, you greatly complicate data processing and reduce data integrity. This will cause lots of headaches and problems later on.
It's just smart to honor best practices using First Normal Form for any Data Base Management System. Implementing Second and Third Normal Form practices greatly depends on First Normal Form being done correctly. If we cannot ensure that First Normal Form is practiced, our data sets will end up being potentially useless.
Setting rules during the data entry process and training operators ensures that your data will be of the highest quality. Again, breaking the rule of First Normal Form is much harder to fix later on. This issue is the most noticeable with string or text data.
4. Fix Problems When Data Sets Are Small
An important lesson that I've learned for data validation is ensuring that you fix problems when your data set is still small.
If a data set only has a few hundred rows, problems may not be that noticeable. However, once we start having thousands or millions of data rows, manual checks and validation becomes very expensive.
So fix the problems and implement best practices while your data sets are still small! Preferably fix the issues before data entry even begins. Testing is key.
5. Enfore Specific Units for Numbers
Using different units is a serious problem for data input. Data validation of units is important to ensure that accuracy is maintained. For example, if we are capturing the height of people in both inches and cm, it will result in significant issues or errors during analysis.
This one seems obvious, but it's a very common mistake. Operators will make their own assumptions about data inputs, unless they have strict rules.
6. Use Standard Data Formatting
Managing stored data is another thing to consider. After entry and validation, save data in a format that any application can read.
Avoid using any proprietary format. ASCII, Unicode, open and uncompressed formats are very useful.
7. Use Only One Database
After data validation is complete, store data in a single location consistently. Avoid using different tables and sheets during your data entry processes.
Otherwise, transferring data from one table or sheet to another, or combining separate tables may cause some unexpected errors or even result in losing critical data.
Additionally, data entry and validation processes on the front end should occur in an isolated way. By keeping it isolated, there’s no risk of contaminating or deleting data in your master table or database.
Also, do not allow simultaneous updates when data entry and validation are taking place. Otherwise, you may end up creating duplicates or creating other problems.
8. Use ACID on Your Database!
ACID is great for any data base management system. Let me explain. Using ACID simply means having Atomic, Consistent, Isolated and Durable (ACID) database design and practices.
Design a database that uses these principles before you even start your data entry processes if possible.
- Atomic practices ensure that all updates happen or no updates happen at all.
- Consistent practices ensure that there are no violations of integrity rules.
- Isolated practices do not allow second access until the first process or an update is done.
- Durable practices ensure that a database is robust.
In this article, we discussed best practices for both data entry and data validation. While the list above is not all-inclusive, it provides a great summary of factors to consider.
Additionally, it's wise to consider the pros and cons of both automated and manual data entry systems. Every situation is different. Often times, a hybrid system is best.
Also think about the size of your data set, cost, and time. Allow the best practices above to help develop the best plan for your situation.