Ingest

From deep
Jump to: navigation, search

Ingest takes an AFF image and adds it to the Drives Database.

Images start on the Imaging System.

  • Task sends the images to the image archive. (currently /usr/affs/sync.sh on acquisition machine, which sends to the appropriate directory on T and DOMEX)
  • On the archive server the image is encrypted if it is not already encrypted.
  • Get (Drive SN, YearOfImaging) from AFF file. This is the drive identity tuple (DIT).
  • See if the DIT is already in the database; if so, report a failure.
  • Create new database entry.
  • Using afxml, import each of the metadata fields into the drives table. Right now this is done with the domex/ingest_xml.py program.


Server Side

  • The script ingest/ingest.py reads each AFF file that's been added.
    • For each file that's not in the database, it extracts the metadata from the AFF file and adds it to the project database.
  • Background task which:
    • Figures out which files in the database have no matching WALK file.
    • Locks the file (somehow)
    • Starts walking the file.

walk_new.py

This is a script which is designed to be run on a single machine or a cluster. It:

  • Gets a list of all the AFF files.
  • Get a list of all the XML files and each file's version number (the version of fiwalk which made it.)
  • Detect if there are any dead files (fiwalk crashed).
  • Finds an AFF file for which there is no matching XML file.
    • We might want to prioritize, so it first walks the unwalked.
    • If all are walked, it re-walks ones that were walked with older versions of the walk program.
  • Locks it (somehow)
  • Starts walking.
  • need to have a work queues database table. It should be locked to say "this is in process" and have the machine which is currently running the queue.

Current missing

  • copyright/license restrictions needs to be noted on ingest.
  • Would be nice to have derfault rules --- ingest on this machine during this time frame is covered by this copyright.


DriveID

Every drive has a DriveID. This is an integer that we use to track the drive in the database. Early drive AFF files were in the form driveid.AFF, but this is not a requirement.