Five basic data hygiene checks most companies neglect

Data can be the beating heart of a company; whether it’s operational performance or customer insights, data is an essential tool for better decision-making regardless of industry

By Steve Salvin
January 12, 2024
Posted in Artificial Intelligence, Technology

Yet, data hygiene falls by the wayside for many companies, compromising data quality at scale. This is thanks to the sheer scale of information housed by companies, and the fact that humans are largely responsible for its capture and storage. According to research conducted by Gartner, 40% of company data is inaccurate and incomplete.

Inaccurate or incomplete data can at worst lead to flawed analysis, lost revenue and even regulatory penalties. The old adage holds true – garbage in means garbage out. That’s where data hygiene comes in. It’s a critical practice for business decisions which hinge on the accuracy of data and its resulting insights, and for successful adoption of AI.

Here are five critical checks that companies should regularly conduct to ensure their data is in optimal shape.

Review your data permissions

Visible information is sensitive information. Yet, given how much data companies accumulate, employers might be surprised to learn what information is sitting under the surface, waiting to be dug up by employees who have no business doing so. For workforces who have AI at their fingertips, anything that isn’t stored or permissioned correctly can now also be found much more easily.

To avoid data security risks, business leaders should check who has access to what data currently, and update the relevant permissions so that only staff members who are authorised to view different data sets can do so. Where information isn’t required by individuals, data encryption offers an additional layer of protection.

Automating this process offers businesses an up-to-date view of their data, and an easier way of managing information and permissions at scale across different systems and sources. If you’re working with limited, manual resources, sensible data management processes and robust, regular checks are vital.

Make sure you have up-to-date permission to collect data from users and customers, too – before data you weren’t authorised to hold in the first place travels around the team. Under GDPR rules, businesses must determine and communicate how long they will hold data, and at what point they will ask contacts to refresh consents or update marketing preferences.

Check your data is properly labelled

Data labelling is the process of adding data (known as “metadata”) to summarise and organise existing data. Data which is grouped, categorised and described in a consistent way can more easily be found and made use of by employees who need it – even if it is stored outside of the systems and teams that individuals usually work in.

Automated data labelling can be particularly useful for enterprises who are handling complex data at scale, across number of different sources. This way, enterprises reduce their risk of developing siloed, ‘dark’ data, which cannot be found or used to derive insights for decision making. Whereas effective, enterprise-wide data labelling ensures that teams can reap the full benefits that their data can offer.

If AI is being used in the business, data labelling also ensures that models can make effective use of relevant information – both for training purposes and to generate accurate insights. Using public AI models in particular isn’t advisable, due to the security risks involved. But if companies do choose to grant ChatGPT or other public models access to their data, they can at least fence off information which is classified as “sensitive” through proper data labelling.

Properly labelling data also enables companies to prove that their activities are compliant, find the information needed to handle data subject rights requests within tight timelines, and ensure that data is deleted beyond a set retention period. As regulation in this space evolves, it’s essential that these systems and processes are in place and operating effectively so that businesses can respond quickly.

Watch out for duplicate data

Duplicate data is perhaps more common than you’d expect. This can happen when employees copy a document to create a second version, with minor changes, or when different employees create new records for customers that already exist in the database. It can also happen when data is integrated from different source systems that bypass data validation rules.

Duplicate information puts businesses at risk of skewing or confusing their data gathering insights as they fuel new AI models with inaccurate or outdated information.

Individual staff members are also at risk of making errors at the hands of duplicate data. For example, when multiple versions of the same company policy document exist, employees may accidently refer to the outdated version. AI runs the same risk if it is mistakenly referring to an old, duplicate version of an employee handbook, for example.

In addition, duplicate information wastes Cloud storage space which could result in your company spending more than it needs to on storage and managing its IT estate.

Check for ‘dark’ data

Dark data refers to data that’s been collected but that’s been ‘lost’ or not stored properly, making it impossible to derive insights from or inform decision making. At Aiimi, we’ve found that 73% of the information stored by companies is never used again after the day it was created, on average.

There are three main issues when it comes to dark data. Firstly, it takes up storage space that ends up costing you more money. Secondly, it means you miss out on valuable insights. True value comes from making connections between known and unknown pieces of information in a specific context, providing insight that would otherwise have been missed. And thirdly, dark data poses a genuine risk to your organisation and your customers. If you don’t know that data exists, you can’t know what kind of sensitive content it contains, or that it’s been stored safely with the proper permissions. Automating data indexing and enrichment ensures that organisations get the full data picture – and aren’t left in the dark.

Consider automating these checks to get the best from data

The sheer scale of the ever-increasing datasets being housed by companies can make it difficult to maintain data hygiene. Data is stored in hundreds of different corporate systems which talk different ‘languages’ and aren’t connected. Data may be interconnected in some areas, but what about data that sits outside of the company’s emails, OneDrive and SharePoint systems? Even within single systems, data may be stored under different names or use different key terms, created by employees operating in different countries or even in different departments. This makes it incredibly difficult to manually maintain data quality and consistency.

The most secure, speedy and effective solution for companies that store a lot of data is to automate data hygiene upkeep. Specialist, AI-driven tools can automate the process of getting data permissions in check, labelling data, and sorting duplicate data and dark data – and ensure that once this data is in order, it stays in order. There should be no need for companies to hand over their information at any point during this process if the AI-tools being used are, like Aiimi’s, designed to give companies more, not less, control over their data. And, at the end of the day, data control is what data hygiene is all about.

So there you have it. The five key data checks every company should be hot on if they want to maintain good data hygiene, to yield both accurate and better company insights, but to also ensure they’re staying compliant.

Steve Salvin

Steve Salvin is the founder and CEO of Aiimi, a leading British AI company which he has bootstrapped since its launch in 2013. Their tech helps teams find, make sense of and retain control over their data, and is used by various FTSE100 companies as well as the likes of the FCA, PwC, and the UK government. Having worked in tech since the 80s, Steve is a serial entrepreneur and is passionate about building AI that empowers users and gives them more control.