@@ -32,10 +32,17 @@ head(RLdata500)
32
32
```
33
33
34
34
Next we specify the model parameters for the entity attributes.
35
- For simplicity, we use a uniform prior for the distortion probability
36
- associated with each attribute .
35
+ We define a beta prior for the distortion probabilities that favors low
36
+ distortion (positively skewed) .
37
37
``` {r, eval=TRUE, message=FALSE, warning=FALSE}
38
- unif_prior <- BetaRV(1, 1)
38
+ beta_prior <- BetaRV(1, 4)
39
+ ```
40
+ We define a flexible Dirichlet Process prior (with a vague gamma hyperprior on
41
+ the concentration) for the distortion distribution.
42
+ The distortion distribution is used to pick an alternative attribute value for
43
+ a record if the entity attribute value is distorted.
44
+ ``` {r, eval=TRUE, message=FALSE, warning=FALSE}
45
+ dp_prior <- DirichletProcess(alpha = GammaRV(2, 1e-4))
39
46
```
40
47
41
48
We model the distortion for the name attributes (` fname_c1 ` and ` lname_c1 ` )
@@ -54,20 +61,25 @@ with a constant distance function.
54
61
``` {r, eval=TRUE, message=FALSE}
55
62
attr_params <- list(
56
63
fname_c1 = Attribute(transform_dist_fn(Levenshtein(), threshold = 3.0),
57
- distort_prob_prior = unif_prior),
64
+ distort_prob_prior = beta_prior,
65
+ distort_dist_prior = dp_prior),
58
66
lname_c1 = Attribute(transform_dist_fn(Levenshtein(), threshold = 3.0),
59
- distort_prob_prior = unif_prior),
60
- by = CategoricalAttribute(distort_prob_prior = unif_prior),
61
- bm = CategoricalAttribute(distort_prob_prior = unif_prior),
62
- bd = CategoricalAttribute(distort_prob_prior = unif_prior)
67
+ distort_prob_prior = beta_prior,
68
+ distort_dist_prior = dp_prior),
69
+ by = CategoricalAttribute(distort_prob_prior = beta_prior,
70
+ distort_dist_prior = dp_prior),
71
+ bm = CategoricalAttribute(distort_prob_prior = beta_prior,
72
+ distort_dist_prior = dp_prior),
73
+ bd = CategoricalAttribute(distort_prob_prior = beta_prior,
74
+ distort_dist_prior = dp_prior)
63
75
)
64
76
```
65
77
66
78
Finally we specify the prior over the linkage structure (clustering). Here we
67
79
use a Pitman-Yor random partition for the prior, with hyperpriors on the
68
80
concentration and discount parameters.
69
81
``` {r, eval=TRUE, message=FALSE}
70
- clust_prior <- PitmanYorRP(alpha = GammaRV(1, 1 ), d = BetaRV(1, 1))
82
+ clust_prior <- PitmanYorRP(alpha = GammaRV(1, 1e-2 ), d = BetaRV(1, 1))
71
83
```
72
84
73
85
All that remains is to initialize the model and run inference.
@@ -83,9 +95,9 @@ We recommend inspecting trace plots to verify that the Markov chain has
83
95
reached equilibrium and is mixing well. The results below seem acceptable given
84
96
the small number of samples.
85
97
``` {r}
86
- n_linked_ents <- extract(result, "n_linked_ents")
87
- distort_probs <- extract(result, "distort_probs")
88
- distort_counts <- extract(result, "distort_counts")
98
+ n_linked_ents <- exchanger:: extract(result, "n_linked_ents")
99
+ distort_probs <- exchanger:: extract(result, "distort_probs")
100
+ distort_counts <- exchanger:: extract(result, "distort_counts")
89
101
plot(n_linked_ents)
90
102
plot(distort_probs)
91
103
plot(distort_counts)
0 commit comments