Pandas: k-Anonymity with groupby
Created: 26.01.2019
In this article, I demonstrate the k-anonymity approach with the pandas groupby functionality. This approach ensures that there is no way to select less than or equal to k-individuals (or rows) via atribute-value combinations in a select statement. Of course, key attributes are excluded as they are unique in nature.
Depending on your use case, you probably either want to know, what type of k-anonymity your data has or you want to make sure that your data has at least a certain k-anonymity. Suppose, we have the following dataframe.
import pandas as pd
columns = ['Age', 'Zip', 'Contact']
data = [
[18, 'AB', 'mail'],
[18, 'AB', 'phone'],
[18, 'AB', 'mail'],
[18, 'AB', 'phone'],
[19, 'AB', 'mail'],
[19, 'AB', 'mail'],
[18, 'CD', 'phone'],
[18, 'CD', 'mail'],
[18, 'CD', 'phone'],
[19, 'CD', 'mail'],
]
df = pd.DataFrame(data, columns=columns)
Note: Make sure that the indexes of the dataframe are unique, especially when you concat/merge dataframes. This is very important, otherwise the drop function won't work properly. If you want to concatenate several dataframes use:
df = pd.concat( [df, df2]).reset_index(drop=True)
Grouping by person-specific attributes
Perform the grouping operation on the attributes containing person-sepecific data. In this case, it is 'Age' and 'Zip'.
grouped = df.groupby(['Age', 'Zip'])
Remember, if you do not add a selection after the groupby (which is often done in examples), the result has the datatype DataFrame and shares the same structure as the original DataFrame. Check this article for more information.
Computing k-anonymity
Now there are several options to detect groups with a size smaller than k. I will use a rather naive approach by iterating over the groups as I want to use that approach later on as well.
for name, group in grouped:
if group.shape[0] <= k:
print(group)
Of course, you can also do something like this, if your DataFrame is small:
# aggregation function
grouped.agg('count')
# return group size as Series
grouped.size()
If your DataFrame is rather large and you only want to display the groups that are below your given threshold k, you can use the code from above or some of these alternatives:
# only works, if other columns than the groupby columns are present
partial_df = grouped.filter(lambda x: len(x['Contact']) == k)
# the line above will yield only the non-grouping columns, to get them back, merge with the original DataFrame
partial_df = pd.merge(partial_df, df, how='left', on='Contact', left_index=True, right_index=True)
# the on parameter avoids renaming the column
partial_df
Ensuring k-anonymity
In this case, you want to make sure that your data complies to a certain k. Let's assume you want to have 2-anonymity.
For this purpose, you can either modify the attributes 'Age' and 'Zip' by making them more coarse-grained. Or you can delete the rows that do not comply.
k = 2
for name, group in grouped:
if group.shape[0] <= k:
df.drop(group.index, inplace=True)
df
This code is probably not the best in terms of performance as iteration is costly. You could store the indexes of the relevant groups already in the first iteration when you count the individuals that do not comply.
After executing the code, the line with index = 9 is removed as the group size was only 1.