Skip to content

Commit 2d54385

Browse files
committed
Vignette: update parameter settings based on latest recommendations in the article
1 parent f35a51e commit 2d54385

File tree

1 file changed

+24
-12
lines changed

1 file changed

+24
-12
lines changed

vignettes/RLdata500.Rmd

Lines changed: 24 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,17 @@ head(RLdata500)
3232
```
3333

3434
Next we specify the model parameters for the entity attributes.
35-
For simplicity, we use a uniform prior for the distortion probability
36-
associated with each attribute.
35+
We define a beta prior for the distortion probabilities that favors low
36+
distortion (positively skewed).
3737
```{r, eval=TRUE, message=FALSE, warning=FALSE}
38-
unif_prior <- BetaRV(1, 1)
38+
beta_prior <- BetaRV(1, 4)
39+
```
40+
We define a flexible Dirichlet Process prior (with a vague gamma hyperprior on
41+
the concentration) for the distortion distribution.
42+
The distortion distribution is used to pick an alternative attribute value for
43+
a record if the entity attribute value is distorted.
44+
```{r, eval=TRUE, message=FALSE, warning=FALSE}
45+
dp_prior <- DirichletProcess(alpha = GammaRV(2, 1e-4))
3946
```
4047

4148
We model the distortion for the name attributes (`fname_c1` and `lname_c1`)
@@ -54,20 +61,25 @@ with a constant distance function.
5461
```{r, eval=TRUE, message=FALSE}
5562
attr_params <- list(
5663
fname_c1 = Attribute(transform_dist_fn(Levenshtein(), threshold = 3.0),
57-
distort_prob_prior = unif_prior),
64+
distort_prob_prior = beta_prior,
65+
distort_dist_prior = dp_prior),
5866
lname_c1 = Attribute(transform_dist_fn(Levenshtein(), threshold = 3.0),
59-
distort_prob_prior = unif_prior),
60-
by = CategoricalAttribute(distort_prob_prior = unif_prior),
61-
bm = CategoricalAttribute(distort_prob_prior = unif_prior),
62-
bd = CategoricalAttribute(distort_prob_prior = unif_prior)
67+
distort_prob_prior = beta_prior,
68+
distort_dist_prior = dp_prior),
69+
by = CategoricalAttribute(distort_prob_prior = beta_prior,
70+
distort_dist_prior = dp_prior),
71+
bm = CategoricalAttribute(distort_prob_prior = beta_prior,
72+
distort_dist_prior = dp_prior),
73+
bd = CategoricalAttribute(distort_prob_prior = beta_prior,
74+
distort_dist_prior = dp_prior)
6375
)
6476
```
6577

6678
Finally we specify the prior over the linkage structure (clustering). Here we
6779
use a Pitman-Yor random partition for the prior, with hyperpriors on the
6880
concentration and discount parameters.
6981
```{r, eval=TRUE, message=FALSE}
70-
clust_prior <- PitmanYorRP(alpha = GammaRV(1, 1), d = BetaRV(1, 1))
82+
clust_prior <- PitmanYorRP(alpha = GammaRV(1, 1e-2), d = BetaRV(1, 1))
7183
```
7284

7385
All that remains is to initialize the model and run inference.
@@ -83,9 +95,9 @@ We recommend inspecting trace plots to verify that the Markov chain has
8395
reached equilibrium and is mixing well. The results below seem acceptable given
8496
the small number of samples.
8597
```{r}
86-
n_linked_ents <- extract(result, "n_linked_ents")
87-
distort_probs <- extract(result, "distort_probs")
88-
distort_counts <- extract(result, "distort_counts")
98+
n_linked_ents <- exchanger::extract(result, "n_linked_ents")
99+
distort_probs <- exchanger::extract(result, "distort_probs")
100+
distort_counts <- exchanger::extract(result, "distort_counts")
89101
plot(n_linked_ents)
90102
plot(distort_probs)
91103
plot(distort_counts)

0 commit comments

Comments
 (0)