aGrUM/pyAgrum 0.22.9 and dataframe
Posted on Tue 05 April 2022 in News
Sampling and learning with pandas dataframes
During APUD22, it was brought to our attention that it would be nice to be able to use pandas dataframe directly for learning or sampling. Without waiting, we set to work to bring you this feature with the release of version 0.22.9. In this short article, we will see how to use these new features. First things first, let's import pandas and pyAgrum:
1 2 | import pandas as pd import pyAgrum as gum |
To illustrate the new sampling feature, it is first necessary to load a Bayesian network (BN). For this example, we will use the Asia network that you can find and download following this link:
3 | bn_ref = gum.loadBN("asia.bif") |
Now that the BN is loaded it is ready for sampling using the generateSample method:
4 | data = gum.generateSample(bn_ref, 1e4)[0] |
As we haven't specified an output file, the method returns a tuple (dataframe, likelihood). Before the version 0.22.9, we would have needed to create a CSV file and then load it into a pandas dataframe using pd.read_csv.
Now we can manipulate the data using this dataframe. For example, let's say we only want to select the variables smoking, lung_cancer and positive_XraY:
5 | data = pd.DataFrame(data, columns=["smoking", "lung_cancer", "positive_XraY"]) |
Moreover, let say we would like to learn the structure on the first 2000 rows of the data set:
6 | train_data = data[:2000] |
This being done, it leads us to the second novelty: you don't need to create a temporary CSV file anymore and you can learn directly learn the BN from the dataframe (!):
7 8 9 10 | s_learner = gum.BNLearner(train_data) # creates a learner by passing the dataframe s_learner.useGreedyHillClimbing() # sets a local-search algorithm for the structural learning s_learner.useScoreBIC() # sets BIC score as the metric structure = s_learner.learnBN() # learning the structure |
Now that we have learned the structure, we only have to learn the parameters of the BN. Let's be adventurous and say that the learning sample contains missing values. To do so, let's introduce these missing values the pandas dataframe:
11 12 | incomplete_data = data[2000:].copy() incomplete_data["smoking"][:500] = "?" # instead of NaN |
Note that we used ? to specify the missing values. From these data we learn the parameterization of the BN using the EM algorithm:
13 14 15 16 | bn_learned = gum.BayesNet(structure) # initializing the bn with the learned structure p_learner = gum.BNLearner(incomplete_data, structure) # crates a learner to learn parameters p_learner.useEM(1e-10) # sets EM to learn parameters p_learner.fitParameters(bn) # learning the parameters |
Finally, we have learned a full BN (structure + parameters) from the data we have generated. Whether it is for sampling, learning the structure or learning the parameters, we have used pandas dataframes to store and pass the data whereas before we would have had to use CSV files. Of course, the old method is still valid, but since it is common to load data into a pandas dataframe to do a pre-analysis of the data this new method will surely be more interesting for some of you.