Dirty Data un escollo para Big Data

Although Big Data and its strategies are on everyone’s lips, the truth of the matter is that there has been work going on tackling the underlying concept of the management and analysis of large volumes of data for many years. However, there is now the added complexity deriving from the three first “V”s of the five upon which Big Data stands: Volume, velocity, variety, veracity and value.

Fruit of the interconnected environment in which we live, companies are adopting these new tools and strategies for the processing of large volumes of data to adapt their supply to the demands of their customers or to locate and exploit new business opportunities. However, they also face the huge challenge of dirty data: databases with incorrect, incomplete, inaccurate, and out of date information, or with duplicated data.

Subsequently, organisations find that a segmentation of customers based on unclean data may lead to the allocation of erroneous indicators to a contact, which can limit their validity, or invalidate it completely, thereby affecting the development of their activity.

The veracity of data, a question of trust

In this context, it is worth examining the source of the possible inaccuracies in the data. To simplify this, we shall consider only the area of customer data. Firstly, it is worth mentioning the erroneous data that is a consequence of unintentional mistakes by users when entering data. In this case there is nothing further to say. Anyone can make a mistake.

Another possible source would be the intentional entry of incorrect data, for illegal purposes (in order to anonymously access information or resources) or for criminal purposes (by means of identity theft). Interesting, but outside of our scope.

There are also other reasons for the existence of Dirty Data, such as the intention of customers and potential customers to hide their identity in order to avoid being identified by the company and to escape any possible harassment from a company, or to be included, are excluded, from certain target segments. And one more, which usually goes unnoticed: the obsolescence of the data itself. There are streets that have their names changed, post codes that are modified, municipalities that merge and others that become separate, etc.

So, how to move forward with Big Data?

There are very efficient solutions that may offer significant savings in terms of processing time and campaign costs. These are solutions based on criteria relating to standardisation and the detection of duplicates which permit the validation of data (name, address, national identification card number, telephone number, current account, etc.) and the detection of anomalies for future processing. In Big Data environments, these are essential in order to validate the data before their analysis.
In relation to verification, in addition to the direct interaction with the customer, we also have specific solutions, always within the confines imposed by legislation.

As means of progressing in the search for data validation, and always beginning with the use of the above-mentioned standardisation and deduplication solutions, it is necessary to take into consideration the advanced analytics of customers and artificial intelligence, with the emergence of the algorithm specialised in this area.

In conclusion, and going back to the start, the only way to generate Value, the fifth “V” of Big Data, is to guarantee the fourth, Veracity. And to do so, we should try to avoid the first three, Volume, Velocity and Variety, they weigh us down and can be used as an excuse for not delivering the last two.

The best way to stop the proliferation of Dirty Data is to opt for Data Quality, to prevent Big Data from becoming a Big Problem.

Mario Peñas, Key Account Manager de DEYDE