Search¶
Methods for searching for conjunctions in a binary (formal) search context.
Overview¶
|
Formal context, i.e., a binary relation between a set of objects and a set of attributes, i.e., Boolean functions that are defined on the objects. |
|
Searches the prefix-tree of core queries given a certain |
|
Simple best-in greedy search for conjunctive query. |
Dictionary of available search methods. |
Details¶
-
class
realkd.search.
Context
(attributes, objects, sort_attributes=True)¶ Formal context, i.e., a binary relation between a set of objects and a set of attributes, i.e., Boolean functions that are defined on the objects.
A formal context provides a search context (search space) over conjunctions that can be formed from the individual attributes.
-
complete_closure
(gen_idx, bit_extension, part_closure)¶ - Parameters
gen_idx –
bit_extension –
closure_prefix –
- Returns
>>> table = [[0, 1, 0, 1], ... [1, 1, 1, 0], ... [1, 0, 1, 0], ... [0, 1, 0, 1]] >>> ctx = Context.from_tab(table) >>> clo = bitarray('1000') >>> ctx.complete_closure(0, bitarray('0110'), clo) 2 >>> clo bitarray('1010') >>> clo = bitarray('1110') >>> ctx.complete_closure(1, bitarray('0100'), clo) 4 >>> clo bitarray('1110')
-
extension
(intent)¶ - Parameters
intent – attributes describing a set of objects
- Returns
indices of objects that have all attributes in intent in common
-
find_small_crit_index
(gen_idx, bit_extension, part_closure)¶ >>> table = [[0, 1, 0, 1], ... [1, 1, 1, 0], ... [1, 0, 1, 0], ... [0, 1, 0, 1]] >>> ctx = Context.from_tab(table) >>> root = Node([], bitarray('0000'), array([0,1,2,3]), bitarray('1111'), -1, -4, 1, inf) >>> ctx.find_small_crit_index(0, bitarray('0110'), bitarray('1000')) 4 >>> ctx.find_small_crit_index(1, bitarray('1101'), bitarray('0100')) 4 >>> ctx.find_small_crit_index(2, bitarray('0110'), bitarray('0010')) 0 >>> ctx.find_small_crit_index(3, bitarray('1001'), bitarray('0001')) 1 >>> ctx.find_small_crit_index(3, bitarray('1001'), bitarray('0101')) 4
-
static
from_df
(df, without=None, max_col_attr=10, sort_attributes=True, discretization=<function qcut>, **kwargs)¶ Generates formal context from pandas dataframe by applying inter-ordinal scaling to numerical data columns and for object columns creating one attribute per value.
For inter-ordinal scaling a maximum number of attributes per column can be specified. If required, threshold values are then selected by the provided discretization function (per default quantile-based).
The restriction should also be implemented for object columns in the future (by merging small categories into disjunctive propositions).
The generated attributes correspond to pandas-compatible query strings. For example:
>>> titanic_df = pd.read_csv("../datasets/titanic/train.csv") >>> titanic_df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin'], inplace=True) >>> titanic_ctx = Context.from_df(titanic_df, max_col_attr=6, sort_attributes=False) >>> titanic_ctx.m 891 >>> titanic_ctx.attributes [Survived<=0, Survived>=1, Pclass<=1, Pclass<=2, Pclass>=2, Pclass>=3, Sex==male, Sex==female, Age<=23.0, Age>=23.0, Age<=34.0, Age>=34.0, Age<=80.0, Age>=80.0, SibSp<=8.0, SibSp>=8.0, Parch<=6.0, Parch>=6.0, Fare<=8.6625, Fare>=8.6625, Fare<=26.0, Fare>=26.0, Fare<=512.3292, Fare>=512.3292, Embarked==S, Embarked==C, Embarked==Q, Embarked==nan] >>> titanic_ctx.n 28 >>> titanic_df.query('Survived>=1 & Pclass>=3 & Sex=="male" & Age>=34') Survived Pclass Sex Age SibSp Parch Fare Embarked 338 1 3 male 45.0 0 0 8.050 S 400 1 3 male 39.0 0 0 7.925 S 414 1 3 male 44.0 0 0 7.925 S >>> titanic_ctx.extension([1, 5, 6, 11]) array([338, 400, 414])
>>> titanic_ctx = Context.from_df(titanic_df, max_col_attr=defaultdict(lambda: None, Age=6, Fare=6), ... sort_attributes=False) >>> titanic_ctx.attributes [Survived<=0, Survived>=1, Pclass<=1, Pclass<=2, Pclass>=2, Pclass>=3, Sex==male, Sex==female, Age<=23.0, Age>=23.0, Age<=34.0, Age>=34.0, Age<=80.0, Age>=80.0, SibSp<=0, SibSp<=1, SibSp>=1, SibSp<=2, SibSp>=2, SibSp<=3, SibSp>=3, SibSp<=4, SibSp>=4, SibSp<=5, SibSp>=5, SibSp>=8, Parch<=0, Parch<=1, Parch>=1, Parch<=2, Parch>=2, Parch<=3, Parch>=3, Parch<=4, Parch>=4, Parch<=5, Parch>=5, Parch>=6, Fare<=8.6625, Fare>=8.6625, Fare<=26.0, Fare>=26.0, Fare<=512.3292, Fare>=512.3292, Embarked==S, Embarked==C, Embarked==Q, Embarked==nan]
- Parameters
df (DataFrame) – pandas dataframe to be converted to formal context
max_col_attr (int) – maximum number of attributes generated per column; or None if an arbitrary number of attributes is permitted; or dict (usually defaultdict) with keys being columns ids of df and values being the maximum number of attributes for the corresponding column (again using None if no bound for a specific column); Note: use defaultdict(lambda: None) instead of defaultdict(None) to specify no maximum per default
discretization (callable) – the discretization function to be used when number of thresholds has to be reduced to a specificed maximum (function has to have identical signature to pandas.qcut, which is the default)
without (Iterable[str]) – columns to ommit
- Returns
Context
representing dataframe
-
static
from_tab
(table, sort_attributes=False)¶ Converts an input table where each row represents an object into a formal context (which uses column-based representation).
Uses Boolean interpretation of table values to determine attribute presence for an object.
For instance:
>>> table = [[0, 1, 0, 1], ... [1, 1, 1, 0], ... [1, 0, 1, 0], ... [0, 1, 0, 1]] >>> ctx = Context.from_tab(table) >>> list(ctx.extension([0, 2])) [1, 2]
- Parameters
table – table that explicitly encodes object/attribute relation
- Returns
context over table row indices as objects and one tabulated feature per column index
-
-
class
realkd.search.
CoreQueryTreeSearch
(ctx, obj, bnd, order='bestboundfirst', apx=1.0, max_depth=10, verbose=False, **kwargs)¶ Searches the prefix-tree of core queries given a certain
Context
.The idea of core queries is that there is only one core query per potential query extensions. Thus using them as solution candidates results in a strongly condensed search space compared to naively searching all conjunctions.
Core queries have been originally introduced for closed itemset mining [UAUA04]. The following is a simple recursive definition that does not require the notion of closures:
The trivial query \(q = \top\) is a core query. Moreover, the tail augmentation \(qp_i\) of a core query \(q\) is also a core query if \(q \not\rightarrow p_i\) and
\begin{equation} \text{for all } j < i, \text{ if } q' \rightarrow p_j \text{ then } q \rightarrow p_j \enspace . \end{equation}- Parameters
ctx (Context) – the context defining the search space
obj (callable) – objective function
bnd (callable) – bounding function satisfying that
bnd(q) >= max{obj(r) for r in successors(q)}
order (str) – traversal order (
'breadthfirst'
,'depthfirst'
,'bestboundfirst'
, or'bestvaluefirst'
; seetraversal_orders
)apx (float) – approximation factor that determines guarantee of what fraction of search space will be traversed, i.e., all nodes q are visited with
obj(q) >= apx * opt
(default1.0
, i.e., optimum will be visited; smaller values means less traversal elementsmax_depth (int) – maximum depth of explored search nodes
verbose (int) – level of verbosity
-
run
()¶ Runs the configured search.
- Returns
Conjunction
that (approximately) maximizes objective
-
traversal
()¶ A first example with trivial objective and bounding function is as follows. In this example the optimal extension is the empty extension, which is generated via the the non-lexicographically smallest generator [0, 3].
>>> table = [[0, 1, 0, 1], ... [1, 1, 1, 0], ... [1, 0, 1, 0], ... [0, 1, 0, 1]] >>> ctx = Context.from_tab(table) >>> search = CoreQueryTreeSearch(ctx, lambda e: -len(e), lambda e: 0, order='breadthfirst') >>> for n in search.traversal(): ... print(n) N([], [], -4, inf, [0 1 2 3]) N([0], [0 2], -2, 0, [1 2]) N([1], [1], -3, 0, [0 1 3]) N([2], [0 2], -2, 0, [1 2]) N([3], [1 3], -2, 0, [0 3]) N([0, 1], [0 1 2], -1, 0, [1]) N([0, 3], [0 1 2 3], 0, 0, [])
>>> search.verbose = True >>> search.run() Found optimum after inspecting 7 nodes: [0, 3] Completing closure Greedy simplification: [0, 3] c0 & c3
>>> CoreQueryTreeSearch(ctx, lambda e: 5-len(e), lambda e: 4+(len(e)>=2), apx=0.7).run() c0 & c1
Let’s use more realistic objective and bounding functions based on values associated with each object (row in the table).
>>> values = [-1, 1, 1, -1] >>> f = lambda e: sum((values[i] for i in e))/4 >>> g = lambda e: sum((values[i] for i in e if values[i]==1))/4 >>> search = CoreQueryTreeSearch(ctx, f, g) >>> for n in search.traversal(): ... print(n) N([], [], 0, inf, [0 1 2 3]) N([0], [0 2], 0.5, 0.5, [1 2]) N([2], [0 2], 0.5, 0.5, [1 2])
Finally, here is a complex example taken from the UdS seminar on subgroup discovery.
>>> table = [[1, 1, 1, 1, 0], ... [1, 1, 0, 0, 0], ... [1, 0, 1, 0, 0], ... [0, 1, 1, 1, 1], ... [0, 0, 1, 1, 1], ... [1, 1, 0, 0, 1]] >>> ctx = Context.from_tab(table) >>> labels = [1, 0, 1, 0, 0, 0] >>> from realkd.legacy import impact, cov_incr_mean_bound, impact_count_mean >>> f = impact(labels) >>> g = cov_incr_mean_bound(labels, impact_count_mean(labels)) >>> search = CoreQueryTreeSearch(ctx, f, g) >>> for n in search.traversal(): ... print(n) N([], [], 0, inf, [0 1 2 3 4 5]) N([0], [0], 0.11111, 0.22222, [0 1 2 5]) N([1], [1], -0.055556, 0.11111, [0 1 3 5]) N([2], [2], 0.11111, 0.22222, [0 2 3 4]) N([3], [2 3], 0, 0.11111, [0 3 4]) N([0, 2], [0 2], 0.22222, 0.22222, [0 2]) >>> search.run() c0 & c2
>>> labels = [1, 0, 0, 1, 1, 0] >>> f = impact(labels) >>> g = cov_incr_mean_bound(labels, impact_count_mean(labels)) >>> search = CoreQueryTreeSearch(ctx, f, g) >>> for n in search.traversal(): ... print(n) N([], [], 0, inf, [0 1 2 3 4 5]) N([0], [0], -0.16667, 0.083333, [0 1 2 5]) N([1], [1], 0, 0.16667, [0 1 3 5]) N([2], [2], 0.16667, 0.25, [0 2 3 4]) N([3], [2 3], 0.25, 0.25, [0 3 4])
-
traversal_orders
= {'bestboundfirst': <class 'realkd.search.BestBoundFirstBoundary'>, 'bestvaluefirst': <class 'realkd.search.BestValueFirstBoundary'>, 'breadthfirst': <class 'realkd.search.BreadthFirstBoundary'>, 'depthfirst': <class 'realkd.search.DepthFirstBoundary'>}¶ dictionary of available traversal orders for core query search
-
class
realkd.search.
GreedySearch
(ctx, obj, bdn=None, verbose=False, **kwargs)¶ Simple best-in greedy search for conjunctive query.
Search starts from the trivial (empty) query and adds an objective-maximizing conditions until no further improvement with a single augmentation is possible.
>>> table = [[1, 1, 1, 1, 0], ... [1, 1, 0, 0, 0], ... [1, 0, 1, 0, 0], ... [0, 1, 1, 1, 1], ... [0, 0, 1, 1, 1], ... [1, 1, 0, 0, 1]] >>> ctx = Context.from_tab(table) >>> labels = [1, 0, 1, 0, 0, 0] >>> from realkd.legacy import impact >>> f = impact(labels) >>> GreedySearch(ctx, f).run() c0 & c2
- Parameters
ctx (Context) – the context defining the search space
obj (callable) – objective function
bnd (callable) – bounding function satisfying that
bnd(q) >= max{obj(r) for r in successors(q)}
(for signature compatibility only, not currently used)verbose (int) – level of verbosity
-
run
()¶ Runs the configured search.
- Returns
Conjunction
that (approximately) maximizes objective
-
realkd.search.
search_methods
= {'exhaustive': <class 'realkd.search.CoreQueryTreeSearch'>, 'greedy': <class 'realkd.search.GreedySearch'>}¶ Dictionary of available search methods.
References¶
- UAUA04
Takeaki Uno, Tatsuya Asai, Yuzo Uchida, and Hiroki Arimura. An efficient algorithm for enumerating closed patterns in transaction databases. In Disc Science, 16–31. Springer, 2004.