Skip to content

Some ambiguous and some likely wrong translations #53

@GruberMarkus

Description

@GruberMarkus

Hi @nager,

thank you very much for creating and sharing this data! Vielen Dank!

When experimenting with it, I created a simple search code using string normalization and simplification. The tests showed that a normalized country name can refer to multiple countries at the same time. This is not a big surprise, as one word can mean totally different things in different languages. As I think that this is unlikely for country names, I had a look at the data.

CD.json, CG.json

The "Democratic Republic of the Congo" and the "Republic of the Congo" are often mistaken for each other. This also seems to be the case in CD.json and CG.json. Even mixing up the critical distinguishing word "democratic" between the two files.

In both files, the normalized string "kongo" appears multiple times in the translations, which are probably not up to date. I think that updating the translations will make "kongo" unique to only one country.

GF.json

The PT translation is very likely not "Guiana", but "Guiana Francesa". Changing this will make the normalized string "guiana" unique to only one country.

AS.json

The ID translation is very likely not "Amerika Serikat", but "Samoa Amerika". Changing this will make the normalized string "amerikaserikat" unique to only one country.

PowerShell Code used

Here is the quick and dirty PowerShell code I used for my experiments:

$TempDir = New-Item -Path (Join-Path $env:TEMP -ChildPath "CountryData-$(Get-Date -Format 'yyyyMMddHHmmss')") -ItemType Directory
$FilePath = Join-Path $TempDir -ChildPath 'countries.zip'

Invoke-WebRequest -Uri 'https://github.com/nager/Nager.Country/releases/latest/download/countries.zip' -OutFile $FilePath
Add-Type -Assembly 'System.IO.Compression.Filesystem'
[System.IO.Compression.ZipFile]::ExtractToDirectory($FilePath, $TempDir)
Remove-Item $FilePath -Force


$Countries = @{}
$Countries['Countries'] = [System.Collections.ArrayList]::new()
$Countries['NormalizedNamesToCommonName'] = @{}

foreach ($inputFile in @(Get-ChildItem -LiteralPath $TempDir -Include '*.json' -File)) {
  $inputFileContent = Get-Content $inputFile.FullName -Raw -Encoding UTF8
  $inputFileContent = ConvertFrom-Json $inputFileContent | Select-Object *, sourceFile, normalizedNames

  $inputFileContent.sourceFile = $inputFile.Name

  $normalizedNames = New-Object System.Collections.Generic.HashSet[string]

  foreach ($property in @('commonName', 'officialName', 'nativeName', 'alpha2Code', 'alpha3Code', 'translations.name')) {
    if ($property -match '\.') {
      $temp = $inputFileContent.$(($property -split '\.')[0])

      ($property -split '\.' | Select-Object -Skip 1) | ForEach-Object {
        $temp = $temp.$_
      }
    } else {
      $temp = $inputFileContent.$property
    }

    $temp | Where-Object { $_ } | ForEach-Object {
      $tempB = ($_.Normalize('FormKD') -replace '[\p{M}\p{P}\p{S}\p{C}\p{Z}\s]').ToLower()
      if (-not [string]::IsNullOrWhiteSpace($tempB)) {
        [void]$normalizedNames.Add($tempB)
      }
    }
  }

  $normalizedNames | ForEach-Object {
    if (-not ($Countries['NormalizedNamesToCommonName']).ContainsKey("$($_)")) {
      $Countries['NormalizedNamesToCommonName']["$($_)"] = [System.Collections.ArrayList]::new()
    }
    [void]($Countries['NormalizedNamesToCommonName']["$($_)"]).Add($inputFileContent.commonName)
  }

  $normalizedNames = '|' + @($normalizedNames -join '|') + '|'

  $inputFileContent.normalizedNames = $normalizedNames

  [void]($Countries.Countries).add($inputFileContent)
}

$Countries | Add-Member -MemberType ScriptMethod -Name GetCountryByName -Force -Value {
  param([string]$SearchString)

  $SearchString = ($SearchString.Normalize('FormKD') -replace '[\p{M}\p{C}\p{Z}\s]').ToLower()

  if ($SearchString.StartsWith('.')) {
    $tempName = ".$($SearchString -replace '[\p{P}\p{S}]')"

    if ($this.Countries.tld -contains $tempName) {
      return $this.Countries | Where-Object { $_.tld -contains $tempName }
    }
  }

  if ($SearchString.StartsWith('+')) {
    $tempName = ($SearchString -replace '[\p{P}\p{S}]').TrimStart('00')

    if ($this.Countries.callingCodes -contains $tempName) {
      return $this.Countries | Where-Object { $_.callingCodes -contains $tempName }
    }
  }

  # Apply the remaining normalization logic to the input search string
  $SearchString = $SearchString -replace '[\p{P}\p{S}]'


  if ($SearchString -match '^\d+$') {
    return $this.Countries | Where-Object { $_.numericCode -eq $SearchString }
  }

  return $this.Countries | Where-Object { [System.Globalization.CultureInfo]::InvariantCulture.CompareInfo.IndexOf(
      $_.normalizedNames,
      "|$($SearchString)|",
      [System.Globalization.CompareOptions]::IgnoreCase -bor [System.Globalization.CompareOptions]::IgnoreNonSpace
    ) -ge 0 }
}

$Countries['NormalizedNamesToCommonName'].getenumerator() | ForEach-Object {
  if (($_.name[0] -notin @('.', '+')) -and ($_.value.count -gt 1)) {
    Write-Host "'$($_.name)' refers to $($_.value.count) countries"

    $_.value | ForEach-Object {
      Write-Host "  $($Countries.GetCountryByName($_).alpha2Code)"
    }
  }
}

Common question regarding alternate country names

Country names can be written in so many different styles and wording. Let's take our home country for example: "Austria", "Repulic of Austria", "Austrian Republic", "Austria, Republic of" may not all be correct, but they all refer to the same.

In many IT systems, country is a free text field, making it hard to tie that free text to an actual country. Do you have an idea has this could be solved with lightweight code?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions