Table Of Contents

Previous topic

Data

Next topic

Generic Data Structure

PRM

prm module

The PRM module specifies attributes, entities, relationships and dependencies. The information in the PRM is very interlinked since different methods have to access the same data from different starting points. e.g. access all the attributes from a given entity is easier from that specific entity instance via entity.attributes, whereas when iterating over all attributes in the prm it is easier to do using prm.attributes. Naturally all attributes, entities, relationships, dependencies are instanciated only once and then referenced. The method xml_prm.parser.parsePRM() is initializing all instance variables.

prm.prm.attributes

Dictionary of all Attribute instances

prm.prm.datainterface

Path to a compatible datainterface xml specifiaction

prm.prm.dependencies

Dictionary of all Dependency instances

prm.prm.entities

Dictionary of all Entity instances

prm.prm.name

Name of the Probabilistic Relational Model

prm.prm.relationships

Dictionary of all Relationship instances

prm.prm.topoSortAttributes

List of attributes that are topologically sorted using prm.attribute.topologicalSort()

attribute module

All attributes need to implement the class Attribute that defines a set of methods that need to implemented. Currently ProbReM supports a set of discrete variables, some attribute types are not probablistic and serve another purpose, e.g. as a foreign key.

All attributes are instantiated by calling the attributeFactory().

digraph inheritance26a2f3948e { rankdir=LR; size="8.0, 12.0"; "prm.attribute.ExistAttribute" [style="setlinewidth(0.5)",URL="#prm.attribute.ExistAttribute",fontname=Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans,height=0.25,shape=box,fontsize=10]; "prm.attribute.Attribute" -> "prm.attribute.ExistAttribute" [arrowsize=0.5,style="setlinewidth(0.5)"]; "prm.attribute.Attribute" [style="setlinewidth(0.5)",URL="#prm.attribute.Attribute",fontname=Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans,height=0.25,shape=box,fontsize=10]; "prm.attribute.NotProbabilisticAttribute" [style="setlinewidth(0.5)",URL="#prm.attribute.NotProbabilisticAttribute",fontname=Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans,height=0.25,shape=box,fontsize=10]; "prm.attribute.Attribute" -> "prm.attribute.NotProbabilisticAttribute" [arrowsize=0.5,style="setlinewidth(0.5)"]; "prm.attribute.BinaryAttribute" [style="setlinewidth(0.5)",URL="#prm.attribute.BinaryAttribute",fontname=Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans,height=0.25,shape=box,fontsize=10]; "prm.attribute.Attribute" -> "prm.attribute.BinaryAttribute" [arrowsize=0.5,style="setlinewidth(0.5)"]; "prm.attribute.IntegerAttribute" [style="setlinewidth(0.5)",URL="#prm.attribute.IntegerAttribute",fontname=Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans,height=0.25,shape=box,fontsize=10]; "prm.attribute.Attribute" -> "prm.attribute.IntegerAttribute" [arrowsize=0.5,style="setlinewidth(0.5)"]; "prm.attribute.EnumeratedAttribute" [style="setlinewidth(0.5)",URL="#prm.attribute.EnumeratedAttribute",fontname=Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans,height=0.25,shape=box,fontsize=10]; "prm.attribute.Attribute" -> "prm.attribute.EnumeratedAttribute" [arrowsize=0.5,style="setlinewidth(0.5)"]; "prm.attribute.ForeignAttribute" [style="setlinewidth(0.5)",URL="#prm.attribute.ForeignAttribute",fontname=Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans,height=0.25,shape=box,fontsize=10]; "prm.attribute.Attribute" -> "prm.attribute.ForeignAttribute" [arrowsize=0.5,style="setlinewidth(0.5)"]; }

class prm.attribute.Attribute(name, erClass, hidden=False)[source]

An ‘abstract’ class that defines an attribute (variable) of an entity or relationship class.

CPD

The Conditional Probability Distribution of an attribute of type :class:’.CPD’

ID

Every attribute has a unique identifier that can be used when hashing. At this point the fullname is used. We could also think of some numerical value derived form the name for performance.

cardinality

The cardinality is the size of the domain. This value has to be assigned by the specific attribute class which is being instantiated.

dependenciesChild

A list of the Dependency instances that the given attribute is a child of

dependenciesParent

A list of the Dependency that the given attribute is a parent of

domain

The domain is a list of all possible values the attribute can take. This value has to be assigned by the specific attribute class which is being instantiated.

erClass

Every attribute is attached to an entity or relationship class. erClass = Entity or Relationship Object

fullname[source]

The full name an attribute is either ‘Entity_name.Attribute_name’ or ‘Relationship_name.Attribute_name’

hasParents[source]

Returns True if the number of parents is not zero

hidden

Boolean. If True then there is not a corresponding data field (latent variable)

indexing

The dictionary indexing serves to access the index of the domain values. indexing stores { key= domain value : value= index of domain value}.

  • domain = [4,5,6] -> indexing = {4:0,5:1,6:2}
  • domain = [‘A’,’B’,‘0’] -> indexing = {‘A’:0,’B’:1,’C’:2}
indexingValue(value)[source]

Returns the index of the domain list given an attribute value. This is a function because different attribute classes can compute this index differently.

Parameters:value – Value that is in the domain
name

The name of the attribute has to be unique among the attributes of the same class, e.g. two attributes in different entities could have the same name.

parents

A list of the parents, all of type Attribute

probabilistic[source]

Returns True if the attribute is probabilstic. Overwrite this function in all non probabilistic attributes, e.g. .NotProbabilisticAttribute or .ForeignAttribute

type[source]

The type of an attribute, e.g. BinaryAttribute, IntegerAttribute, EnumeratedAttribute, NotProbabilisticAttribute

class prm.attribute.BinaryAttribute(name, er, hidden=False)[source]

A Binary Attribute can only take on two different values

indexingValue(value)[source]

Returns the index of the domain list given an attribute value. For a binary value it is faster to just return the value since it corresponds to the index.

Parameters:value0 or 1
class prm.attribute.EnumeratedAttribute(name, er, attrValues, hidden=False)[source]

A Enumerated Attribute can take values stored in domain. Note that there are no constraints on what is passed in attrValues. In case of working with strings, the performance will be lower because a lot of string operations will have to be executed.

cardinality

Size of domain

domain

List of domain values

indexingValue(value)[source]

Returns the index of the domain list given an attribute value.

Parameters:value – Value that is in the domain
class prm.attribute.ExistAttribute(name, er, hidden=False)[source]

An Exist Attribute is a binary variable used when making inference in uncertain relationships. Reference Uncertainty implies that we don’t know which objects of two associated entities are connected through the relationship. Thus we assume a full relationship, meaning that there is an object for every possible combination of entity objects. If the exist attribute of a relationship oject is 1, then the object is considered to be present in the data. Relationship are usually sparse, e.g. only few connections exist. There is no need to store all possible connections, not in the data nor in the model. The relationship type is usually even more restricted (e.g. 1:n), allowing for a efficient representation.

indexingValue(value)[source]

Returns the index of the domain list given an attribute value. For a binary value it is faster to just return the value since it corresponds to the index.

Parameters:value0 or 1
class prm.attribute.ForeignAttribute(name, target, er, hidden=False)[source]

An Foreign Attribute figures as part of the primary key of a relationship class. The foreign attribute points to the primary key of an entity class which is stored in target.

CPD

The CPD is shared with the target attribute. As the target is often the primary key, the CPD would be None.

fullname[source]

Overwritten from class Attribute. The full name a foreign attribute is ‘Relationship_name.Target_name’.

probabilistic[source]

The foreign attribute itself can’t be probabilistic’

target

The target is an attribute of an entity class that the relationship class the forgein attribute is part of. Often this is the primary key.

class prm.attribute.IntegerAttribute(name, er, attrRange, hidden=False)[source]

A Integer Attribute can take values in a certain interval

cardinality

Size of domain

domain

List of integer values

indexingValue(value)[source]

Returns the index of the domain list given an attribute value.

Since the domain of an Integer Attribute is an interval, it is faster to just subtract.

Parameters:value – Int value that is in the domain
class prm.attribute.NotProbabilisticAttribute(name, er, hidden=False)[source]

An Attribute that is not probabilistic, which means that it will not have a local distribution and that it can’t be part of any probabilistic dependency. It is required for slotchains that use the non probabilistic primary keys.

probabilistic[source]

Overwritten from Attribute class, always return False

prm.attribute.attributeFactory(name, er, type, attrDef, probabilistic=True, hidden=False)[source]

Returns an instance of the attribute of type type.

prm.attribute.topologicalSort(attributes)[source]

Returns a list of attributes that are lexically sorted. A topological sort or topological ordering of a directed acyclic graph (DAG) is a linear ordering of its nodes in which each node comes before all nodes to which it has outbound edges. Every DAG has one or more topological sort <http://en.wikipedia.org/wiki/Topological_sorting>.

relationalschema module

class prm.relationalschema.ERClass[source]

This abstract class serves as a container for the objects that contain attributes. These objects are either Entity classes or Relationship classes; each can contain Attribute classes which themselves have to know which container object they belong to. The Entity/Relationship classes inherit the ERClass class. Therefore, an attribute can find the type of its container object by calling self.erClass.type()

Inheritance diagram of prm.relationalschema

isEntity()[source]

Returns True if the type is Entity

isRelationship()[source]

Returns True if the type is Relationship

isUncertainRelationship()[source]

Returns True if the type is UncertainRelationship

type()[source]

The type of an ERClass, either Entity or Relationship

class prm.relationalschema.Entity(name)[source]

Represents an entity class in the relational schema.

attributes

List that contains the Attributes references of the entity class

name

Unique name

pk

The primary key is a list of Attribute objects of the entity. The pk is created automatically as a NotProbabilisticAttribute if not specified otherwise. It is stored as a list with just one item.

pk_string

String representation of primary key

relationships

List that contains the Relationship references that are connected to the entity.

class prm.relationalschema.Relationship(name)[source]

An relationship class relates two entity classes ( implicitly using their primary keys as identifiers). Note the source of confusion, Relationship refers to the Entity-Relationship model; not to be confused with the probabilistic Dependency which is conceptually also a relationship

attributes

A dictionary that contains the attributes references of the relationship class {key : Attribute name, value: Attribute}

entities

List of Entities connected to the relationship

foreign

Dictionary represenation of self.pk where the key is an entity and the value a list of foreign attributes that belong to that entity, e.g. {key= Entity : value=[ ForeignAttribute , .. ]}

name

Unique name

pk

The primary key of a relationship class is usually specified by the set of foreign keys of connected entities. A relationship class has a primary key that consists of a list of ForeignAttribute instances whose target‘s are attributes of the connecting entities (usually their primary key attributes).

pk_string

List of string representation of self.pk

class prm.relationalschema.UncertainRelationship(name, nTok, k)[source]

Reference uncertainty introduces uncertainty about the structure of the data itself, e.g. the entries of a relationship table of an ER diagram, and thus the state space of the Markov Chain increases considerably. We associate a binary exist variable with every possible entry in uncertain relationship tables. As the number of exist’ attributes grows exponentially with the size of the tables, inference becomes intractable. We avoid the explosion of the state space by introducing a `constraint attribute that enforces certain structural properties, e.g. a 1:n relationship. However, this results in complex probabilistic dependencies among the exist objects. A more involved Metropolis-Hastings algorithm is required that samples exist objects using an appropriate proposal distribution. A proposal is an assignment to all exist objects associated with one constraint object, which allows us to introduce probabilistic dependencies that would not be allowed in a traditional PRM.

existAttribute

The exist attribute of type ExistAttribute.

k

The value uncertain is the fixed-parameter ntoK in the k in the n:k relationship type. This parameter serves as a fixed-parameter tractability approach, for more information see the documentation.

kEnitity

Reference to the Enitity that is on the k-side of the relationship

nEnitity

Reference to the Enitity that is on the n-side of the relationship

nTok

Boolean. The type of an relationship is n:k (=True) or k:n (=False). If a type is specified it is assumed that it describes the relation between the first two primary keys in pk.

dependency module

class prm.dependency.Dependency(name, parent, child, constraint, aggregator, attributes)[source]

A dependency represents a probabilistic dependency between two Attribute classes, the child and the parent attribute.

aggregator

Aggregation is necessary when a dependency is of type 1:n or m:n as there will be multiple parent objects mapping to a child object’s CPD that has only one parameter for this parent attribute. Aggregation can be any function f(pa1,pa2,...) = pa_{aggr} , see data.aggregation

child

The child Attribute instance is the dependent variable.

computeSlotChain()[source]

The SlotChain is computed via a depth first search algorithm. As there can’t be loops in the relational schema, we can return the first path that we encounter.

Note that when the model doesn’t load, it is usually because of the infinite loop that only quits when a slot chain was found. So far that always resulted from an error in the specification and not in the code...

Another disadvantage is that there could be multiple paths in the same schema. In fact you could define a different dependency for each different path. This method uses the first path that is found as the slotchain.

configureConstraint(attributes)[source]

If a constraint has been defined in the specification, e.g. in the following form:

self.constraint = "...,e1.a1=r1.a2,r1.a3,e2.a4,..."

where e1, e2 are of type Entity, r1 of type Relationship and a1, a2, a3, a4 are of type Attribute. From this string slotchain, slotchain_string and slotchain_attr_string can be extracted. In case no constraint has been specified, computeSlotChain() is called to compute a traditional slotchain.

Parameters:attributes – All Attribute instances in the model
constraint

The constraint of a dependency defines how the attribute objects in the relational skeleton are connected. Introduced by Heckerman et al. in the DAPER model, the concept of a constraint is a generalized version of the slotchain introduced by Getoor et al.

name

Unique name of the dependency

parent

The parent is an Attribute instance

slotchain

Even though the probabilistic dependency uses the constraint when specifying a PRM model, often the constraint results in the traditional slotchain, the ‘path’ through the relational schema that links the parent and child attribute via a list of entities and relationships, connected by foreign keys. The elements in the list slotchain are interchangeably [..., Entity, Relationship, Entity,... ]

slotchainToString()[source]
Returns:String representation of slotchain
slotchain_attr_string

List of the string represenation of the attributes that define the slotchain, e.g. Professor.professor_id=advisor.professor_id

slotchain_erclass_exclusive

Special Dictionary representation of the slotchain. The key is an Entity, and the value is basically self.slotchain_attr_string without all entries that contain the key entity {key = ERClass : value = list of string constraints }.

slotchain_string

List containing the string representation (e.g. Professor, advisor) of the slotchain entities/relationships

class prm.dependency.UncertainDependency(name, parent, child, constraint, aggregator, attributes)[source]

Reference uncertainty introduces uncertainty about the structure of the data itself, e.g. the entries of a relationship table of an ER diagram, and thus the state space of the Markov Chain increases considerably. We associate a binary exist variable with every possible entry in uncertain relationship tables. As the number of exist’ attributes grows exponentially with the size of the tables, inference becomes intractable. We avoid the explosion of the state space by introducing a `constraint attribute that enforces certain structural properties, e.g. a 1:n relationship. However, this results in complex probabilistic dependencies among the exist objects. A more involved Metropolis-Hastings algorithm is required that samples exist objects using an appropriate proposal distribution. A proposal is an assignment to all exist objects associated with one constraint object, which allows us to introduce probabilistic dependencies that would not be allowed in a traditional PRM.

kAttribute

Reference to the Attribute, i.e. a foreign key in an entity instance, that is on the k-side of the relationship. It is either the parent or the child.

nAttribute

Reference to the Attribute, i.e. a foreign key in an entity instance, that is on the n-side of the relationship. It is either the parent or the child.

nIsParent

Is True if self.nAttribute and self.parent refer to the same attribute instance

uncertainRelationship

If uncertainRelationship is True, then uncertainRelationship will point to the uncertain relationship UncertainRelationship

localdistribution module

The model parameters in a ProbReM project are the conditional probability distributions (CPDs) defined for each probabilistic attribute defined in the model. They are also refered to as local distributions interchangeably.

Inheritance diagram of prm.localdistribution

class prm.localdistribution.CPD(attr)[source]

A conditional probability distribution CPD is defined for an attribute. This is an abstract version of a CPD that defines a set of methods all CPD implementations must provide.

attr

The Attribute that the CPD is associated with

logLikelihood(fullAssignment)[source]
Parameters:fullAssignment – List of values order such that [attributeValue,ParentValue1,ParentValue2,....]
Returns:Loglikelihood of fullAssignment
sample(paAssignment)[source]
Parameters:paAssignment – List of parent values
Returns:Randomly drawn sample of the CPD given the paAssignment
save()[source]

Saves the CPD to disk

class prm.localdistribution.CPDTabular(attr)[source]

The tabular representation of a CPD for discrete variables. A matrix of dimensions m x n, where

  • m is the number of possible parent assignments \prod_{pa \in Parents} |V(pa)|
  • n is the cardinalitiy of the attribute domain |V(attr)|

This matrix grows exponentially with the number of parents, thus not suited for large V-Structures.

computeCumulativeDist()[source]

Calculates the cumulative distribution of the tabular CPD by incrementally summing the columns

computeLogDists()[source]

Calculates the log probability distribution cpdLogMatrix and cumulative log probability distribution cumLogMatrix

conditionalDist(gbnV)[source]

Returns the conditional probability distribution of the gbnV given its parent values.

Parameters:gbnVGBN instance
Returns:A 1 x |attr.domain| numpy.array probability distribution
cpdLogMatrix

Log values of cpdMatrix

cpdMatrix

The CPD matrix of type numpy.array. The rows represent different parent assignments, the columns of a row define the distribution over the attribute.

cpdMatrixDim

Dimension of cpdMatrix

cumLogMatrix

Log values of cumMatrix

cumMatrix

Cumulativ cpdMatrix. Computed by computeCumulativeDist()

indexColumn(attrValue)[source]

See indexingCPD()

indexRow(parentAssignment)[source]

See indexingCPD()

indexingCPD(currentRow)[source]

Returns the row and column indices for a full assignment to the attribute attr. indexRow is the index of the row of the cpd matrix that corresponds to the assignment of the parent attributes. The parents attribute values are ordered the same way as in attr.parents. indexColumn is the index of the column that corresponds to the assignment of the attribute value itself.

Parameters:currentRow – List containing a full assignment, [attributeValue,`parentValue1`,`parentValue2`,....]
Returns:Tuple [indexRow,`indexColumn`]
initCPD()[source]

Computes the number of possible parent assigments and the index multipliers needed to compute the row index of a given parent assignment, see indexingCPD().

logLikelihood(fullAssignment)[source]
Parameters:fullAssignment – List of values order such that [attributeValue,`parentValue1`,`parentValue2`,....]
Returns:Loglikelihood of fullAssignment using cpdLogMatrix
reverseIndexRow(index)[source]

Computes the parent assignment given an row index of cpdMatrix

Parameters:index – Row index of cpdMatrix
Returns:Parent assignment associated with index
sample(paAssignment)[source]

Samples a random value using cumMatrix

Parameters:paAssignment – List of parent values
Returns:Randomly drawn sample of the CPD given the paAssignment
save(relPath='./localdistributions')[source]

Saves cpdMatrix to disk using numpy.save and outputs the XML specification that can be added to the PRM specification.

Parameters:relPath – Relative path to the local distribution files, starting from the directory where the model is instantiated from.
class prm.localdistribution.CPDTree[source]

Future implementation for a CPD based on a decision tree. No need so far.