Decision history

OSdatascanner is fairly well documented in a technical sense. That is, the code is mostly written in a self-explanatory way. However, there is a gap between understanding how something works and why it should work.

This chapter documents the decisions made along the way, and is intended to be a source of consultation in case of considerations about removing code or altering functionality.

Rule validation

Some rules have additional validation logic. This logic is intended to weed out false positives, and often stem from real-world cases.

CPRRule

The CPR rule matches 10-digit numbers through a regular expression. These numbers are allowed to be separated at specific places by specific symbols: The first six digits may not be separated. Digit 6 and digit 7 may be separated by a space, a tabulation, or one of the symbols - (with spaces on either side), /, or .. The last 4 digits must not be separated.

In addition, CPRRule has a few validation options.

Modulus-11 check

Danish CPR-numbers issued before the 1st of October 2007 calculate their last digit by the "Modulus-11"-method. CPR-numbers from some dates are exempt from this check. This exception is considered in OSdatascanners modulus-11 check, and exempted numbers are considered valid CPR-numbers.

This validation serves as an initial method to ensure that any 10-digit number identified is indeed a legitimate and valid Danish CPR-number.

Probability check

The CPRRule can also calculate a probability for the CPR-number. The check works by calculating all valid CPR-numbers for the birth date of the checked number, and finding the index of the checked number in the sequence of all valid numbers.

Since numbers are generated sequentially, the later numbers are less likely to be in use than the earlier numbers in the sequence. If the CPR-number is not valid for the given birth date of the number, the probability is zero.

If the birth date of the CPR-number is in the future at the time of checking, the probability will always be zero.

The check was implemented in order for the report module to be able to show the most likely CPR matches in the UI first.

Context check

The CPRRule context check validation consists of several checks.

The context check is implemented as a single option when adding the CPRRule to a custom rule, and essentially allows OSdatascanner to disregard CPR-like numbers based on the context present around the matched number.

Below, the different checks are presented in order of application.

Bin check

As the rule matches the content with a regular expression, it differentiates between all 10-digit numbers, and the numbers identified as CPR-numbers.

All found numbers are then divided into bins, and each bin can then allow its contained numbers to register as a match, based on a few rules:

The number of valid CPR-numbers in the bin make up at least 25% of all numbers in the bin.
The number of valid CPR-numbers in at least one neighbouring bin make up at least 25% of the numbers in that bin.

If both of these requirements are not met, the valid CPR-numbers in those bins are not considered further.

This check is implemented in consideration of a specific case, where a customer has spreadsheets with a large armount of 10-digit numbers, some of which may coincidentally be valid CPR-numbers.

This check makes sure, that in files with a few CPR-numbers in between large amounts of invalid numbers, we assume all 10-digit numbers are not CPR-numbers. If the file contains a local high density of valid numbers, those will still be considered further.

The number of bins is set to N / (3 * log N), where N is the total number of cpr-like numbers in a given object. With a cutoff value at 25%, and assuming the probability of a random 10-digit number being a valid cpr is 0.4%, and assuming the numbers are equally distributed between the bins, this ensures that the probability of getting any false positive is less than 0.1%.

The cutoff value of 25% is set to be considerably higher than the random chance that a 10-digit number will be a valid CPR-number: 3.72%. This number is calculated purely from the restrictions on the first 4 digits, and does not take into account modulus-11 or similar.

Blacklist

If any of the words from the blacklist are present in the content of the scanned object, no matches are validated. The blacklist contains the following words: p-nr, p.nr, p-nummer, pnr, customer no, customer-no, bilagsnummer, order number, ordrenummer, fakturanummer, faknr, fak-nr, tullstatistisk, tullstatistik, test report no, protocol no., dhk:tx.

The words in the blacklist are found by identifying unique words, which we only ever see present in a certain type of file, which will never contain Danish CPR numbers.

Whitelist

If the abbreviation "cpr" (case insensitive) is present up to 3 words away from the matched number, the match is validated by context, no matter the results of the the following checks.

Surrounding words

Similar to the Blacklist context check, this context check invalidates a match based on the presence of specified words. However, in this case, the match is only invalidated if any of these words are found within three words of the matched number.

This approach allows for the exclusion of matches that are not true CPR numbers based on their surrounding context, without excluding all CPR-like numbers in the scanned source.

This feature was implemented in response to a customer request to enable the exclusion of matches based on their surrounding context.

Historical notes

Up to and including version 3.28.0, OSdatascanner enforced two additional constraints as part of the context check:

any delimiters (regular, square, curly and angle parentheses, as well as various programming language comment markers) found within three words of a CPR number had to be balanced; and
the presence of the symbols +, -, !, #, or % within three words of a CPR number would cause it to be ignored.

These were developed several years ago in response to customer feedback about reducing false positives, but several customers subsequently reported to us that these constraints were too rigid and were causing the scanner engine to overlook real (test) CPR numbers. We have accordingly removed them in newer releases.

--

Up to and including version 3.29.2, OSdatascanner would also skip over a CPR number candidate if it was adjacent to a mixed-case word, on the assumption that mixed-case words might indicate a procedurally generated string. Many ordinary names violate this assumption, however, so we have removed this constraint.

--

Up to and including version 3.30.8, the cutoff value for the bin check was 15%. This was then updated to 25%. While expected percentage of valid cpr-numbers in a list of random 10 digit numbers, is less than 0.4%, this update was done on the basis of a customer provided file of "P-numre", where many of the pages included between 16% and 20% valid cpr-numbers.

--

Up to and including version 3.32.1, OSdatascanner invalidated a CPR number candidate if the word immediately before or after it was another number that did not itself match the CPR-number regex, on the assumption that a non-CPR number adjacent to the candidate indicated it was really part of a larger number separated by spaces.

This assumption does not hold for tabular data. F.e. in spreadsheets, CSV exports and OCR'd forms, a genuine CPR column is commonly followed by other numeric columns (phone numbers, postal codes, amounts) or by stray OCR-read digits. That means that the check was quite prone to false negatives.

The spreadsheet false-positive case it probably was meant to address (large amounts of 10-digit numbers, some coincidentally valid) is handled more robustly by the bin check, which reasons about the density of valid numbers across the whole object rather than looking at a single neighbour.

This check has been removed. It may result in more false positives if we're dealing with coincidentally valid numbers in random data such as log files, but until further field-testing has been conducted, the false negative potential outweighs.

--

The above check was in fact only made toggleable (CPRRULE_CONTEXT_NUMBER_CHECK), not removed outright: on by default, off in dev./test, pending field-testing.

Investigating a related tabular-data false negative (#60766) found the actual cause: the surrounding-word extraction split times like "13:20" on the colon, so a trailing minute number could look like an unrelated number next to a CPR candidate in the next csv row. A colon between two digits is now kept as part of the word, same as "-", "." and "/". This is narrower than those three, which glue onto a word regardless of what's on either side: a colon only merges when there's a digit on each side of it (so "8:30am" keeps the "am" too, since it's still attached to that second digit), while a label's colon (e.g. "Account:") still splits off as before, keeping the check's original purpose intact - catching a lone CPR-shaped fragment of a longer reference number split by a stray space, which the bin check can't catch on its own.

Exceptions

When adding a new CPRRule to a CustomRule, it is possible to define a list of 10-digit numbers, which should not be matched. All numbers present in this list are invalidated, even if they are valid Danish CPR-numbers.

This is implemented to answer a concern from some clients, after they looked up some false positives in the Danish CPR register, and confirmed that a lot of the numbers validated with the CPRRule are not in use. We considered consulting the register during a scan, but decided against it due to performance concerns.

It is possible, in the future, that we would implement a call to the CPR register to save a temporary dump when a scan is run, which we can then refer to during scans.

NameRule

OSdatascanners name rule is made to match names belonging to people. The basic regular exression looks for up to five instances, and at least two, of individual names, which are each identified as up to two "simple names" connected by a hyphen. A simple name is a word consisting of an upper case letter, then either nothing more, a period or only lower case letters.

The first instance of these individual names are identified as the first name, the last instance is the last name, and each other instance in between are middle names.

Compare to list of names

The rule compares to a list of all first names and last names in Denmark from 2014. If the found first and last names are present in the lists, the match probability is 100%, otherwise, if only one of the found names is present in the lists, the match probability is 50%.

Expansive search

As an optional setting, the rule can expand its search. After identifying full names, the rule will aggressively search for strings that could potentially be part of a name (including individual capital letters). Any matches found in this expansive search have a probability of 10%.

Previously, this was default behaviour, but resulted in so many false positives that the rule was effectively unusable.

AddressRule

OSdatascanner can scan for addresses with the built-in AddressRule. This rule matches Danish addresses by the rules specified here.

Compare to list of addresses

For validation, matched streets are compared to a list of Danish street names. The found street names must be contained in this list for the rule to match.

Whitelist

Is the broadest kind of whitelisting currently available to this rule. This will whitelist any house number on given street name.

Whitelist Address

It's quite common for business addresses to be present in f.e. a footer on every page of a website. By using whitelist address, you can whitelist specific street name + house number combinations.

Pre-commit hooks

The OSdatascanner repository contains a number of pre-commit hooks, which are intended to be used for linting and testing.

Linting

We use flake8 8 for linting of Python. We have tried to use ruff 9 as an alternative, but as it lacks parity with Pylint, it cannot fully substitute flake8.
Pylint rules which have been implemented: https://github.com/astral-sh/ruff/issues/970

Duplicate File Detection

OSdatascanner can optionally detect duplicate files during a scan and record statistics about them for later analysis. This is intended as a tool to help identify how much scanning work is being spent on identical copies of the same file.

Note that this feature does not affect which files are scanned. Every file is processed normally regardless of whether it has been seen before. Duplicate detection only collects statistics.

When CHECK_DUPLICATION is enabled, the worker, by default, computes a BLAKE2b hash of the full contents of each file after processing it. This is done through the content_identifier property now present on Resources which should be overwritten, whenever hashing is not desired such as on emails. BLAKE2b was chosen over alternatives such as SHA-256 for performance, while still providing strong collision resistance for non-adversarial use cases.

Full-file hashing is used rather than partial hashing (e.g. hashing only the first and last N bytes) in order to ensure that only identical files are flagged as such. If hashing times out or fails for any reason, the file is still scanned normally and the hash will not be stored.

The admin application's status collector then maintains a HashCache table for all active scans. When a status message arrives with a file hash:

The hash is inserted into HashCache. If this succeeds, the file is new for this scan and nothing further happens.
If the insert fails due to a uniqueness conflict, the hash has been seen before. A DuplicationStat record is created or updated to track the number of times this hash has been encountered.

The HashCache is cleaned up automatically when a scan completes or is cancelled, to avoid accumulating stale rows.

Per-scan conversion queues

Earlier versions of OSdatascanner sent conversion messages through shared queues - os2ds_conversions by default, or conversions_full and conversions_delta when the QUEUE_PRIORITY configuration was used to keep delta scans out of the way of full scans.

Cancelling a scan in that model meant either draining the shared queue while filtering out messages tagged for the cancelled scan - which is slow, and keeps workers doing wasted work for as long as the queue holds cancelled messages - or purging the queue entirely, which would also discard any other scans currently in flight. Neither was acceptable.

We now give each scan its own conversion queue, named after the scanner PK and the scan start time (osds_conversions.{pk}_{YYYYMMDDTHHMMSS}). Cancellation collapses to a single server-side queue_delete: the broker drops every undelivered message for the cancelled scan in one operation, and the workers' channels close cleanly because they receive a delete_queue broadcast first.

The trade-off is more queues and a slightly more involved discovery flow: workers subscribe to per-scan queues dynamically via broadcast CommandMessages (new_queue, delete_queue), and a worker_hello round-trip on worker startup makes sure restarted workers learn about ongoing scans they would otherwise have missed. The cost is small compared to actually being able to cancel a scan in milliseconds.

For configuration details, see RabbitMQ.