Data sources
OS2datascanner supports many data sources, and only a thin API layer is needed to connect more to the system. This document gives a brief overview of them.
Dropbox
Exchange Web Services
OS2datascanner can connect to a Microsoft Exchange installation, either locally
or in the cloud, using the
Exchange Web Services
API. (OS2datascanner uses the exchangelib
package as its implementation of the API.)
Try it out
(To the best of the project's knowledge, there exist no independent implementations of the Exchange Web Services API, so you'll need access to a functioning Exchange installation to follow these steps.)
Under the Exchangescanner tab in the administration system, choose Add scannerjob.
In the URL field, provide the domain whose emails should be scanned, with or
without a leading @
sign (for example, @company.example
).
In the Service endpoint field, provide the URL to the Exchange Web Services API instance. If you don't fill it in, then the EWS autodiscovery protocol will be used instead.
Under Brugernavn og Adgangskode, you'll need to provide the details of a
service account with the special role ApplicationImpersonation
. Having this
role lets a service account act on behalf of any other user in the same
management scope. (Naturally, OS2datascanner only uses this to read messages.)
Contact your system administrator if you don't have access to such an account.
(Note that Brugernavn should typically resemble an email address -- that is,
service-account@company.example
, not service-account
.)
Note that Exchange Web Services for Office 365 does not support this use of a service account as of December 31st, 2022. This means that EWS can only presently be used to communicate with on-premises installations.
EWS doesn't offer any way of discovering the users present in an Exchange installation, so you'll need to get that from Active Directory or from Azure AD. Choose the Organisatoriske enheder field to use OS2datascanner's LDAP support to automatically scan all of the users detected in your organisational hierarchy, or upload a UTF-8 text file with one account name on each line with Upload fil.
[comment:] # ## Filesystem scans
Google Workspace
OS2datascanner has initial support for scanning organisational Gmail and Google Drive accounts. Google Workspace does not support the OAuth2 client credentials flow used by the Microsoft Graph sources, so the use of these data sources requires a lot of manual configuration.
HTTP
OS2datascanner can scan web sites in two ways:
-
as a traditional crawler, that traverses all of the links and references found in a web site and scans these recursively; or
-
as a simple linear scan of all of the resources enumerated in a sitemap XML file (either one present on the site or one uploaded to the administration system).
As a policy decision, OS2datascanner does not honour the robots.txt
file, so
you should normally only run it on sites under your control.
Note that OS2datascanner's user agent advertises both its own version and that
of the underlying python-requests
library:
OS2datascanner 3.17.7 (python-requests/2.28.1) (+https://os2datascanner.dk/agent)
Be aware of this if you need to whitelist the user agent; in particular, make
sure that a blacklist rule for python-requests
doesn't take priority.
Notes on the crawler
The crawler implements a simple depth-first search of a website. Given a website with the following links:
index.html |
a.html |
b.html |
---|---|---|
a.html b.html c.html |
a1.html a2.png |
b1.html b2.jpg b3.html |
... a crawl starting at index.html
would emit links in the following order:
index.html
;- the links from
index.html
:a.html
,b.html
andc.html
; - the links from
a.html
:a1.html
anda2.png
; - the links from
b.html
:b1.html
,b2.jpg
andb3.html
; - the links from
a1.html
, if there were any; - etc.
Only links of the form <a href="" />
(where rel="nofollow"
is not set) and
<img src="" />
are treated as candidates for crawling. To avoid infinite
recursion, links are only crawled to a certain depth, configurable for each
installation.
Links are only crawled when they're to a "similar enough" domain. Links to
other domains will be emitted, to enable dead link detection, but not otherwise
explored or processed by the rest of the pipeline. The precise definition of
"similar enough" may vary between releases, but as of version 3.18.7 a scan of
example.com
would be permitted to explore links under all of the following
domains:
www.example.com
www2.example.com
m.example.com
ww1.example.com
ww2.example.com
en.example.com
da.example.com
secure.example.com
Upgrading the security of a connection is treated as "similar enough", but
downgrading it is not. (A scan of http://example.com/
is allowed to explore
links under https://example.com/
, but not vice-versa.)
If the crawler is configured to search a prefix, then links belonging to the
domain but not under that prefix will not be emitted at all. That is, while
scanning https://example.com/subtree/
, no link to
https://example.com/index.html
would be emitted, even if one was found.
Notes on sitemaps
When using a sitemap, OS2datascanner will emit the specified root page and the files enumerated in the sitemap, and nothing else. Crawling is disabled when using a sitemap, which can provide better performance.
Starting with release 3.22.2, OS2datascanner also supports
Google's image extensions to the sitemap schema. Earlier releases
do not support the extensions: only those links present in a <loc />
tag are
emitted.
OS2datascanner trusts the hints provided by a sitemap over the information
provided by HTTP headers: if the <lastmod />
element contains a last
modification date for a URL, then its Last-Modified
header value won't even
be fetched. (This header is often overridden by a proxy server or web cache, so
its value can be less reliable.)
OS2datascanner also implements a sitemap extension, the <hints />
element,
that can be used to give the same behaviour for the Content-Type
header:
<url>
<loc>
https://www.example.com/resources/2023/STD-2023-0001.PDF
</loc>
<lastmod>
2023-01-19
</lastmod>
<hints xmlns="https://ns.magenta.dk/schemas/sitemap-hints/0.1"
content-type="application/pdf" />
</url>
Using these two elements properly can greatly reduce the number of HTTP requests OS2datascanner must make.
Note that hints are only valid for the scan in which they were found: if
OS2datascanner finds a match in a file whose MIME type was specified by the
sitemap, then subsequent checkups for that file will retrieve the
Content-Type
header.
Try it out
The development environment includes a web server with a few conspicuous files.
Under the Webscanner tab in the administration system, choose _Add
scannerjob, and specify the URL http://nginx/
.
The web server can be scanned both with a sitemap (http://nginx/sitemap.xml
)
and without.
Microsoft Graph
OS2datascanner has support for scanning resources present in Microsoft Graph, and can participate in the normal OAuth2 client credentials flow to allow administrators to revocably delegate permissions to an OS2datascanner instance. Microsoft Graph can also be used as a source of organisational information.
Office 365 mails, OneDrive and SharePoint files, and calendar invitations are the only resources presently supported. (Microsoft restricts API access to Microsoft Teams, so this feature remains under internal test.)
Try it out
Log in to your Microsoft Graph tenant as a global administrator. Under the App registrations blade, choose New registration.
Choose a name for the application (OS2datascanner dev test, for example),
specify that it's a single tenant app, and give
http://localhost:8040/grants/msgraph/receive/
as a redirect URL (of type
Web).
Under the resulting Overview blade, copy the application ID and provide it to
the OS2datascanner administration system as the setting MSGRAPH_APP_ID
. Then
open the Certificates & secrets blade and create a new client secret. Copy
its value and provide it to the administration system as the setting
MSGRAPH_CLIENT_SECRET
.
Open the API permissions blade and give the application the following application permissions:
Calendars.Read
Directory.Read.All
Files.Read.All
Mail.Read
Sites.Read.All
(Because OS2datascanner doesn't operate in the context of a specific user, but rather of the organisation as a whole, these must be application permissions rather than delegated ones.)
Once you've done that, return to your OS2datascanner instance and choose one of the Office 365 scanner types. The first time you set one of these up, you'll be redirected to Microsoft and asked to confirm that you want your OS2datascanner instance to have access to your tenant; after this has been done once, OS2datascanner will remember the delegation and reuse it for future scanner jobs.
SMB
Using the libsmbclient
and pysmbc
packages, OS2datascanner can scan SMB
servers, better known as Windows network drives. These packages also give
OS2datascanner the ability to perform ad hoc authentication using normal
Windows login credentials, so there's no need to permanently enroll the scanner
engine's server into the Windows domain.
(OS2datascanner can also use the SMB support built into the operating system kernel, but this is deprecated, as it requires that certain scanner components be given higher privilege levels.)
Try it out
The development environment includes a Samba server which you can use to test SMB scans. Under the Filescanner tab in the administration system, choose Add scannerjob.
The URL field is used to specify the UNC path to the network drive you'd like
to scan. UNC paths are of the form //server-name/path/to/folder
(with either
forward or backward slashes). Fill in //samba/e2test
here.
(If your Windows environment maps the given UNC path to a drive letter, you can optionally provide that in the Drevbogstav field. This is only used for display purposes.)
Under the Brugeroplysninger section, leave the Brugerdomæne field empty,
and specify the username os2
and the password swordfish
.