Tech topics

What is Tokenization?

Illustration of IT items with focus on a question mark

Overview

Tokenization is a process by which PANs, PHI, PII, and other sensitive data elements are replaced by surrogate values, or tokens. Tokenization is really a form of encryption, but the two terms are typically used differently. Encryption usually means encoding human-readable data into incomprehensible text that is only decoded with the right decryption key, while tokenization (or “masking”, or “obfuscation”) means some form of format-preserving data protection: converting sensitive values into non-sensitive, replacement values – tokens – the same length and format as the original data.

  • Tokens share some characteristics with the original data elements, such as character set, length, etc.
  • Each data element is mapped to a unique token.
  • Tokens are deterministic: repeatedly generating a token for a given value yields the same token.
  • A tokenized database can be searched by tokenizing the query terms and searching for those.

As a form of encryption, tokenization is a key data privacy protection strategy for any business. This page provides a very high-level view of what tokenization is and how it works.

Tokenization

Where did tokenization come from?

Digital tokenization was first created by TrustCommerce in 2001 to help a client protect customer credit card information. Merchants were storing cardholder data on their own servers, which meant that anyone who had access to their servers could potentially view or take advantage of those customer credit card numbers.

TrustCommerce developed a system that replaced primary account numbers (PANs) with a randomized number called a token. This allowed merchants to store and reference tokens when accepting payments. TrustCommerce converted the tokens back to PANs and processed the payments using the original PANs. This isolated the risk to TrustCommerce, since merchants no longer had any actual PANs stored in their systems.

As security concerns and regulatory requirements grew, such first-generation tokenization proved the technology’s value, and other vendors offered similar solutions. However, problems with this approach soon became clear.


What types of tokenization are available?

There are two types of tokenization: reversible and irreversible.

Reversible tokens can be detokenized – converted back to their original values. In privacy terminology, this is called pseudonymization. Such tokens may be further subdivided into cryptographic and non-cryptographic, although this distinction is artificial, since any tokenization really is a form of encryption.

Cryptographic tokenization generates tokens using strong cryptography; the cleartext data element(s) are not stored anywhere – just the cryptographic key. NIST-standard FF1-mode AES is an example of cryptographic tokenization.

Non-cryptographic tokenization originally meant that tokens were created by randomly generating a value and storing the cleartext and corresponding token in a database, like the original TrustCommerce offering. This approach is conceptually simple, but means that any tokenization or detokenization request must make a server request, adding overhead, complexity, and risk. It also does not scale well. Consider a request to tokenize a value: the server must first perform a database lookup to see if it already has a token for that value. If it does, it returns that. If not, it must generate a new random value, then do another database lookup to make sure that value has not already been assigned for a different cleartext. If it has, it must generate another random value, check that one, and so forth. As the number of tokens created grows, the time required for these database lookups increases; worse, the likelihood of such collisions grows exponentially. Such implementations also typically use multiple token servers, for load-balancing, reliability, and failover. These must perform real-time database synchronization to ensure reliability and consistency, adding further complexity and overhead.

Modern non-cryptographic tokenization focuses on “stateless” or “vaultless” approaches, using randomly generated metadata that is securely combined to build tokens. Such systems can operate disconnected from each other, and scale essentially infinitely since they require no synchronization beyond copying of the original metadata, unlike database-backed tokenization.

Irreversible tokens cannot be converted back to their original values. In privacy terminology, this is called anonymization. Such tokens are created through a one-way function, allowing use of anonymized data elements for third-party analytics, production data in lower environments, etc.


Tokenization benefits

Tokenization requires minimal changes to add strong data protection to existing applications. Traditional encryption solutions enlarge the data, requiring significant changes to database and program data schema, as well as additional storage. It also means that protected fields fail any validation checks, requiring further code analysis and updates. Tokens use the same data formats, require no additional storage, and can pass validation checks.

As applications share data, tokenization is also much easier to add than encryption, since data exchange processes are unchanged. In fact, many intermediate data uses – between ingestion and final disposition – can typically use the token without ever having to detokenize it. This improves security, enabling protecting the data as soon as possible on acquisition and keeping it protected throughout the majority of its lifecycle.

Within the limits of security requirements, tokens can retain partial cleartext values, such as the leading and trailing digits of a credit card number. This allows required functions—such as card routing and “last four” verification or printing on customer receipts—to be performed using the token, without having to convert it back to the actual value.

This ability to directly use tokens improves both performance and security: performance, because there is no overhead when no detokenization is required; and security, because since the cleartext is never recovered, there is less attack surface available.


What is tokenization used for?

Tokenization is used to secure many different types of sensitive data, including:

  • Payment card data
  • U.S. Social Security numbers and other national identification numbers
  • Telephone numbers
  • Passport numbers
  • Driver’s license numbers
  • Email addresses
  • Bank account numbers
  • Names, addresses, birth dates

As data breaches rise and data security becomes increasingly important, organizations find tokenization appealing because it is easier to add to existing applications than traditional encryption.

PCI DSS compliance

Safeguarding payment card data is one of the most common use cases for tokenization, in part because of routing requirements for different card types as well as “last four” validation of card numbers. Tokenization for card data got an early boost due to requirements set by the Payment Card Industry Security Standards Council (PCI SSC). The Payment Card Industry Data Security Standard (PCI DSS) requires businesses that deal with payment card data to ensure compliance with strict cybersecurity requirements. While securing payment card data with encryption is allowed per PCI DSS, merchants may also use tokenization to meet compliance standards. Since payments data flows are complex, high performance, and well defined, tokenization is much easier to add than encryption.


Secure sensitive data with tokenization

Tokenization is becoming an increasingly popular way to protect data, and can play a vital role in a data privacy protection solution. OpenText™ Cybersecurity is here to help secure sensitive business data using OpenText™ Voltage™ SecureData, which provides a variety of tokenization methods to fit every need.

Voltage SecureData and other cyber resilience solutions can augment human intelligence with artificial intelligence to strengthen any enterprise’s data security posture. Not only does this provide intelligent encryption and a smarter authentication process, but it enables easy detection of new and unknown threats through contextual threat insights.


Related products

OpenText™ Voltage™ SecureData

Protect high-value data while keeping it usable for hybrid IT

OpenText™ Data Discovery, Protection, and Compliance

Understand and secure data to reduce risk, support compliance, and govern data access

Footnotes