Big data, the true definition

Big data generally refers to data sets that are too large or complex for traditional data processing application software to adequately deal with. Big data challenges include Capture, Storage, Analysis, Data curation, Search, Sharing, Transfer, Visualization, Query, Update and Information privacy. The term often refers simply to the use of predictive analytics or other advanced techniques to extract value from data, and seldom to a particular size of data set.

“There is no single, universally accepted definition of what constitutes ‘big data’.

A 2005 research report by Doug Laney of META Group (now Gartner) defined big data as data sets whose size is beyond the ability of commonly used software tools to manage and process the data within a tolerable elapsed time. In 2010, an IDC white paper defined big data as a new data management platform that includes what they term the 3Vs:

  • Volume: Data is increasing at an unprecedented rate. This is especially true in the case of unstructured data, which is growing faster than any other type of data.
  • Variety: Data comes in many different formats, both structured and unstructured.
  • Velocity: The speed at which data is generated and processed is increasing. This is especially true in the case of real-time data, which must be processed as it is generated.
  • The above three Vs have been expanded to other Vs, such as veracity, value, and variability.

    Big data Size

    There is no specific size that big data sets need to meet in order to be termed ‘big data’. The size of a big data set is relative to the computing power and storage capacity available to the organization or individual working with the data. A large retailer, for example, may consider a data set of a few terabytes to be small, while a research scientist working on a personal computer may consider a data set of a few gigabytes to be big.

    Types of Big Data

    There are two main types of big data:

  • Structured data: This is data that is organized in a predefined manner and is easy to process. This type of data is usually stored in a relational database.
  • Unstructured data: This is data that is not organized in a predefined manner and is difficult to process. This type of data includes emails, videos, images, social media posts, and sensor data.
  • The main difference between structured and unstructured data is that structured data is easier to process because it is organized in a predefined manner, while unstructured data is more difficult to process because it is not organized in a predefined manner.

    Characteristics of Big Data

    The main characteristic of big data is its large size. However, there are other characteristics that are often used to describe big data, including the following:

  • Volume: Big data sets are usually large in size.
  • Variety: Big data sets often contain a variety of data types, such as text, images, audio, and video.
  • Velocity: Big data sets are often generated and processed at high speeds.
  • Veracity: The data in big data sets is often of uncertain quality.
  • Value: Big data sets often contain valuable information that can be used to make decisions.
  • Variability: The data in big data sets often varies in quality and structure.
  • Big Data Applications

    Big data sets are often used in a variety of applications, including the following:

    Predictive analytics:
    Big data sets are often used to build predictive models that can be used to make predictions about future events.

    Marketing:
    Big data sets can be used to identify trends and patterns in customer behavior. This information can be used to make marketing decisions, such as which products to promote and how to target customers.

    Operations: Big data sets can be used to improve operational efficiency. For example, big data can be used to monitor manufacturing equipment to identify potential problems before they occur.

    Risk management: Big data sets can be used to identify and assess risk. For example, banks can use big data to assess the creditworthiness of loan applicants.

    Big Data Analytics

    Big data analytics is the process of analyzing big data sets to extract valuable information. Big data analytics can be used to make predictions about future events, identify trends and patterns in customer behavior, and improve operational efficiency. Big data analytics is usually performed using a combination of traditional data processing techniques, such as statistical analysis and machine learning, and newer techniques, such as data mining and text mining.

    There are a variety of big data analytics tools available, including the following:

  • Statistical analysis: Statistical analysis is a traditional data processing technique that can be used to analyze big data sets. Statistical analysis tools include regression analysis and time series analysis.
  • Machine learning: Machine learning is a type of artificial intelligence that can be used to automatically learn and improve from experience. Machine learning tools include decision trees and support vector machines.
  • Data mining: Data mining is a type of big data analytics that involves searching for hidden patterns and relationships in big data sets. Data mining tools include cluster analysis and association rules.
  • Text mining: Text mining is a type of big data analytics that involves extracting information from text data. Text mining tools include word processors and text analysis software.
  • Graphical analysis: Graphical analysis is a type of big data analytics that involves visualizing data sets. Graphical analysis tools include scatter plots and histograms.
  • The above list is just a few of the big data analytics tools available. There are many other big data analytics tools available, including open-source tools, commercial tools, and cloud-based tools. With big data comes big responsibility. When organizations collect and process large amounts of data, they have a responsibility to protect the privacy of the individuals whose data they are collecting and processing.

    There are a variety of ways to protect the privacy of individuals when working with big data, including the following:

    Anonymizing data: Anonymizing data is the process of removing personally identifiable information from data sets. This can be done by removing names, addresses, and other personal information from the data set.

    Encrypting data: Encrypting data is the process of transforming data into a form that is unreadable by human beings. This can be done using a variety of encryption algorithms.

    Tokenizing data: Tokenizing data is the process of replacing sensitive information with random strings of characters. This can be done to protect the privacy of individuals while still allowing data to be processed and analyzed.

    Hashing data: Hashing data is the process of transforming data into a fixed-size value. This can be done using a variety of hashing algorithms.

    The above list is just a few of the ways to protect the privacy of individuals when working with big data. There are many other techniques that can be used, and organizations should implement the techniques that are best suited to their needs.

    Organizations also have a responsibility to ensure that the data they are collecting and processing is accurate and of high quality. Data quality is an important part of big data.

    There are a variety of ways to improve the quality of data, including the following:

    Data cleansing: Data cleansing is the process of identifying and cleaning up inaccuracies and inconsistencies in data sets. This can be done using a variety of data cleansing tools and techniques.

    Data standardization: Data standardization is the process of transforming data into a common format. This can be done by converting data into a common format, such as XML or JSON.

    Data normalization: Data normalization is the process of transforming data into a uniform format. This can be done by removing duplicate data, converting data into a common format, and standardizing the data.

    The above list is just a few of the ways to improve the quality of data. There are many other techniques that can be used, and organizations should implement the techniques that are best suited to their needs. Big data sets often contain a variety of data types, such as text, images, audio, and video. Big data sets are often used in a variety of applications, including the following:

    Predictive analytics:
    Big data sets are often used to build predictive models that can be used to make predictions about future events.

    Marketing:
    Big data sets can be used to identify trends and patterns in customer behavior. This information can be used to make marketing decisions, such as which products to promote and how to target customers.

    Operations: Big data sets can be used to improve operational efficiency. For example, big data can be used to monitor manufacturing equipment to identify potential problems before they occur.

    Risk management: Big data sets can be used to identify and assess risk. For example, banks can use big data to assess the creditworthiness of loan applicants.

    There is no single definition of “big data,” but in general it refers to data sets that are too large or complex for traditional data processing and analysis tools. Big data often includes data that is unstructured, such as text, images, and videos. It can come from a variety of sources, including social media, sensors, and transactional data.
    Big data presents a number of challenges, including the need for new storage and processing tools, as well as new techniques for analyzing and extracting insights from large and complex data sets. However, big data also presents a number of opportunities, such as the ability to detect previously unseen patterns and correlations, and to make better and more informed decisions.

    Article written by Franck Jr. Walter
    contact me at: franck [at] ketrium.com