Introduction

     NIST is looking at methods to improve automatic filtering, and we want to know how to prioritize our efforts. One of our projects is approximate matching (aka fuzzy hashing). Many tools, such as triage tools, are using approximate matching. We need your input to help focus our efforts on the different classes of approximate matching that will, hopefully, help you the most. For example, do you need to efficiently automate the filtering of image or text files? We can develop tests that will address your needs. Where would a reduction in the amount of data be most beneficial?

     The results of this survey will help us determine which approximate matching algorithms are most needed. For example, files with similar content may have entirely different structures and would appear completely different if an inappropriate algorithm is applied; the similarities would remain unnoticed. A color and grayscale version of the same image would be completely different when using most of the existing schemes.

     These questions will help us test and report on the underlying capabilities of the tools and how these capabilities relate to the real-world requirements. This will allow researchers to improve the tools and align them with the real needs.

Question Title

* 1. How many years of experience (even informal) do you have in digital forensics?

Question Title

* 2. Do you know if any of your tools use approximate matching (aka fuzzy hashing)?

Question Title

* 3. Using a scale of 0 (Not important) to 5 (very Important), please rate the following uses of fuzzy hashing in your opinion? (If you do not have an opinion about any of the following options, you can leave that particular option unchecked.)

  Not at all Important
0
1 2 3 4 Very Important
5
Related Document Detection (Identification of similar documents)
Embedded object Identification (e.g. a jpeg within a word document)
Identification of code versions (Identification of patched or upgraded version of a software)
Fragment Detection (Identification of original document based on a fragment)
Network Correlation (data packet reconstruction from the fragmented files over the network)

Question Title

* 4. What are the various kinds of digital objects that you need to filter out?

  Yes, often Yes, sometimes No
Text
Images
Executable  Program files
Multimedia

Question Title

* 5. Please rate the following characteristics in your opinion based on its importance to classify files? 
(If you have no opinion about any of the following options, you can leave that particular option unchecked.)

  Not at all important
0
1 2 Very Important
3
Content (keep all textual files in one category, non textual files an other in different categories)
File Structure (e.g., text and pdf both contains textual data but different file structure, similarly jpeg and bmp)
File Type (keep text,pdf,doc, docx all in separate categories)

Question Title

* 6. Using the scale of 0 to 3, please rate the following key/ fundamental similarity measure in your opinion?
(If you don't have an opinion about any of the following options, you can leave that particular option unchecked.)

  Not at all important
0
1 2 Very Important
3
Edit distance (Minimum number of operations required to transform one string into the other)
Length of longest common sub-string (A common contiguous substring of maximal length)
Length of longest common subsequence (A common substring of maximal length where the substring might not appear in contiguous fashion but preserves the ordering of characters)

Question Title

* 7. When would you consider executable files/program files as similar? (Please select all that apply)

Question Title

* 8. Rank the options below based on their importance for approximate matching (aka fuzzy hashing)? (1=least and 6=most)

Question Title

* 9. Has the amount of workload which includes file system information increased (e.g. MFT / inode is present)?

Question Title

* 10. If you have any other comments or suggestions please mention below.

T