aGrUM/pyAgrum 0.22.9 and dataframe

Posted on Tue 05 April 2022 in News

Sampling and learning with pandas dataframes

During APUD22, it was brought to our attention that it would be nice to be able to use pandas dataframe directly for learning or sampling. Without waiting, we set to work to bring you this feature with the release of version 0.22.9. In this short article, we will see how to use these new features. First things first, let's import pandas and pyAgrum:

1
2
import pandas as pd
import pyAgrum as gum

To illustrate the new sampling feature, it is first necessary to load a Bayesian network (BN). For this example, we will use the Asia network that you can find and download following this link:

3
bn_ref = gum.loadBN("asia.bif")

Now that the BN is loaded it is ready for sampling using the generateSample method:

4
data = gum.generateSample(bn_ref, 1e4)[0]

As we haven't specified an output file, the method returns a tuple (dataframe, likelihood). Before the version 0.22.9, we would have needed to create a CSV file and then load it into a pandas dataframe using pd.read_csv.

Now we can manipulate the data using this dataframe. For example, let's say we only want to select the variables smoking, lung_cancer and positive_XraY:

5
data = pd.DataFrame(data, columns=["smoking", "lung_cancer", "positive_XraY"])

Moreover, let say we would like to learn the structure on the first 2000 rows of the data set:

6
train_data = data[:2000]

This being done, it leads us to the second novelty: you don't need to create a temporary CSV file anymore and you can learn directly learn the BN from the dataframe (!):

 7
 8
 9
10
s_learner = gum.BNLearner(train_data)  # creates a learner by passing the dataframe
s_learner.useGreedyHillClimbing()     # sets a local-search algorithm for the structural learning
s_learner.useScoreBIC()               # sets BIC score as the metric
structure = s_learner.learnBN()       # learning the structure

Now that we have learned the structure, we only have to learn the parameters of the BN. Let's be adventurous and say that the learning sample contains missing values. To do so, let's introduce these missing values the pandas dataframe:

11
12
incomplete_data = data[2000:].copy()
incomplete_data["smoking"][:500] = "?" # instead of NaN

Note that we used ? to specify the missing values. From these data we learn the parameterization of the BN using the EM algorithm:

13
14
15
16
bn_learned = gum.BayesNet(structure)                   # initializing the bn with the learned structure
p_learner = gum.BNLearner(incomplete_data, structure)  # crates a learner to learn parameters
p_learner.useEM(1e-10)                                 # sets EM to learn parameters
p_learner.fitParameters(bn)                            # learning the parameters

Finally, we have learned a full BN (structure + parameters) from the data we have generated. Whether it is for sampling, learning the structure or learning the parameters, we have used pandas dataframes to store and pass the data whereas before we would have had to use CSV files. Of course, the old method is still valid, but since it is common to load data into a pandas dataframe to do a pre-analysis of the data this new method will surely be more interesting for some of you.