# Teedy File and Document Processing # Searching and Tags ## Tags - Tags can be nested. For example, the "Insurance" tag can be created and, for example, the "Signal Iduna" and "Ammerländer" tags below the tag. These are child elements. If you search for "Ammerländer", you will only find documents that are tagged with Ammerländer. If you search for "insurance", you will find documents that are tagged with "insurance", "Ammerländer" or "Signal Iduna" at the same time. - **Unfortunately, tags can be created twice! Attention!** ## Search operators
Operator | values | Explanation |
by: | String | The creator of the document |
tag: | String | document with given tag |
!tag: | String | document without given tag |
before: | date (allowed formats: yyyy or yyyy-MM or yyyy-MM-dd) | created before date |
ubefore: | date (allowed formats: yyyy or yyyy-MM or yyyy-MM-dd) | edited before date |
after: | date (allowed formats: yyyy or yyyy-MM or yyyy-MM-dd) | created after date |
uafter: | date (allowed formats: yyyy or yyyy-MM or yyyy-MM-dd) | edited after date |
at: | date (allowed formats: yyyy or yyyy-MM or yyyy-MM-dd) | created at date |
uat: | date (allowed formats: yyyy or yyyy-MM or yyyy-MM-dd) | edited at date |
lang: | "eng", "fra", "ita", "deu", "spa", "por", "pol", "rus", "ukr", "ara", "hin", "chi\_sim", "chi\_tra", "jpn", "tha", "kor", "nld", "tur", "heb" | language |
mime: | does not work yet! 1. image/jpeg 2. application/zip 3. application/pdf 4. image/png 5. text/csv 6. text/plain 7. application/vnd.openxmlformats-officedocument.presentationml.presentation 8. application/vnd.openxmlformats-officedocument.wordprocessingml.document 9. application/octet-stream | |
shared: | yes, no | |
workflow: | "me", String | |
full: | String | Use OCR full-text search (files must have been processed with Tesseract!) - full search is default since Teedy 1.9 |
simple: | String | Performs simple search instead full search (ignores OCR) |
\* | Wildcard only possible at the end of the search input string. Not allowed before or in a word | |
| | Pipe operator. Use this to filter things like "or". Example - green|duck - find docs which have green or duck in title | |
"<string>" | phrases can be put into quotes. This will return a more exact result. For example: - "a green duck" - rreturns docs with the exact title "a green duck" - a green duck - returns docs which contain a, green or duck |
All other things which cannot be expressed by the given search parameters can be scripted by SQL queries for H2 or PSQL database instead. You will need to have according access to do this.
You can find a lot of useful SQL statement for filtering out your DMS in our Grafana Dashboard → [Grafana Monitoring / Statistics](https://wiki.stadtfabrikanten.org/books/inventar-und-handbucher/page/grafana-monitoring-statistics "Grafana Monitoring / Statistics") # Example scheme for document title for things like invocies `Check your `%PATH%` variable. This should contain the following executables
**N**ote that the screenshot contains some older directory name.
After entering the connection data this information will be persisted in `%userprofile%\config\preferences\com.sismics.docs.importer.pref` ### Start as daemon and test upload The program can be started with the switch `-d`. It queries the specified folder every 30 seconds and uploads any existing documents to the DMS. The files are then deleted locally. [](https://wiki.stadtfabrikanten.org/uploads/images/gallery/2025-05/23K0K88bcQdw2Kkc-grafik.png) ### Install as Windows Service Create file `C:\Teedy\teedy-service.ps1` ```bash Start-Process -WindowStyle hidden -FilePath C:\Teedy\docs-importer-win.exe -ArgumentList "-d" ``` Create a new task in task schedulerSorry for german screenshots. And please replace "SismicsDocs" with "Teedy" everywhere.
[](https://wiki.stadtfabrikanten.org/uploads/images/gallery/2025-05/kHu6S2fRo8qjHB6z-grafik.png) [](https://wiki.stadtfabrikanten.org/uploads/images/gallery/2025-05/jesZEsrhn0D4BNqq-grafik.png) [](https://wiki.stadtfabrikanten.org/uploads/images/gallery/2025-05/5yJ6I3DeAqIgLH5P-grafik.png) [](https://wiki.stadtfabrikanten.org/uploads/images/gallery/2025-05/7NXRNQ3u8O6iYScE-grafik.png) [](https://wiki.stadtfabrikanten.org/uploads/images/gallery/2025-05/u5qBGUwde7SEcnuV-grafik.png) [](https://wiki.stadtfabrikanten.org/uploads/images/gallery/2025-05/gLwsNX6JDHUxf3pr-grafik.png) [](https://wiki.stadtfabrikanten.org/uploads/images/gallery/2025-05/gmXChlQkS5FPuMHI-grafik.png) Check if service is running. Look for `docs-importer-win.exe` [](https://wiki.stadtfabrikanten.org/uploads/images/gallery/2025-05/8GqOEZBqrYWsBntr-grafik.png) Create a new Desktop Shortcut for your share directory [](https://wiki.stadtfabrikanten.org/uploads/images/gallery/2025-05/mHtwqOb80bL21cir-grafik.png) # Optical Character Recognition (OCR) and Scanning ## Handling OCR data is stored in Teedy database table `t_file` which containts the string column `fil_content_c`. In H2 the data is stored als plaintext string. In PostgreSQL the column is filled as datatype `::text`. A normal select returns number. The unecrypted OCR text data can be accessed from the large object by using some SQL statement like ```sql select fil_name_c, convert_from(loread(lo_open(fil_content_c::int, 131072), 999999999), 'UTF8') from t_file WHERE fil_deletedate_d IS NULL AND fil_content_c IS NOT NULL limit 1; ``` Teedy uses a built in process runner to start the binary `tesseract` with a language parameter. This works if "`tesseract`" is contained in `$PATH` (Linux) or `%PATH%` (Windows) environment variable. ## Fixing faulty fil\_content\_c data the easy way Some quick fix for issue described in [https://github.com/sismics/docs/issues/451](https://github.com/sismics/docs/issues/451) ```sql SELECT fil_content_c FROM t_file WHERE LENGTH(fil_content_c) > 6 ORDER BY fil_createdate_d DESC; UPDATE t_file SET fil_content_c = NULL WHERE LENGTH(fil_content_c) > 6; ``` Converting LOB data to plain text (was required at some point from updating Teedy 1.8 to Teedy 1.9) ```sql /* Show items which start with useless linefeeds. We need to correct those because otherwise we cannot continue with following statements (casting "fil_content_c::int" will fail and other issues) Result may be empty */ SELECT fil_id_c, fil_name_c, fil_content_c FROM t_file WHERE fil_content_c LIKE E'%\n' ; /*Trim beginning linefeeds (only) away*/ UPDATE t_file SET fil_content_c = TRIM(e'\n' FROM fil_content_c) WHERE fil_content_c LIKE E'%\n' ; /* Show faulty data which would return "invalid byte sequence for encoding "UTF8": 0x00" or similar. First we build some function to check for valid UTF8 bytea because sometimes we have faulty stuff inside DB Result may be empty */ CREATE FUNCTION is_valid_utf8(bytea) RETURNS boolean LANGUAGE plpgsql AS $$BEGIN PERFORM convert_from($1, 'UTF8'); RETURN TRUE; EXCEPTION WHEN character_not_in_repertoire THEN RAISE WARNING '%', SQLERRM; RETURN FALSE; END;$$; SELECT fil_id_c, fil_name_c, loread(lo_open(fil_content_c::int, CAST( x'20000' AS integer)), 999999999) AS BYTE_DATA, LENGTH(loread(lo_open(fil_content_c::int, CAST(x'20000' AS integer)), 999999999)) AS LEN FROM t_file WHERE fil_content_c IS NOT NULL AND fil_content_c != '' AND LENGTH(fil_content_c) <= 6 AND is_valid_utf8(fil_content_c::bytea) IS FALSE ; /*We set NULL to all items with faulty UTF-8 encoding (if there were some from previous statement)*/ UPDATE t_file SET fil_content_c = NULL WHERE fil_content_c IS NOT NULL AND fil_content_c != '' AND LENGTH(fil_content_c) <= 6 AND is_valid_utf8(fil_content_c::bytea) IS FALSE ; /* Select OCR content which is in LOB format (Large Object) and valid UTF-8 */ SELECT fil_id_c, fil_name_c, fil_content_c, fil_content_c::bytea, /*shows "invisible" data which does not trigger NULL or ''*/ loread(lo_open(fil_content_c::int, CAST( x'20000' AS integer)), 999999999) AS BYTE_DATA, /*we use the encoding we used to create the database. See setup instructions. Usually this is "UNICODE" or "UTF8"*/ LENGTH(loread(lo_open(fil_content_c::int, CAST(x'20000' AS integer)), 999999999)) AS LEN, convert_from(loread(lo_open(fil_content_c::int, CAST(x'20000' AS integer)), 999999999), 'UNICODE') as "fil_content_c" FROM t_file WHERE fil_content_c IS NOT NULL AND fil_content_c != '' AND LENGTH(fil_content_c) <= 6 AND is_valid_utf8(fil_content_c::bytea) IS TRUE ORDER BY LEN ASC ; /*Convert LOB data into plain text. First we do it for a custom selected file with fil_id_c*/ UPDATE t_file SET fil_content_c = convert_from(loread(lo_open(fil_content_c::int, CAST( x'20000' AS integer)), 999999999), 'UNICODE')::TEXT WHERE fil_id_c = '13411bb0-12fd-4e25-b483-2e2d18b344ed' ; /*Check the conversion value*/ SELECT fil_id_c, fil_name_c, fil_content_c FROM t_file WHERE fil_id_c = '13411bb0-12fd-4e25-b483-2e2d18b344ed' ; /* Now we do mass processing for LOB to plain text DO NOT CONTINUE WITH OTHER STATEMENTS IF THIS ONE FAILS AND CHECK THE UPPER ONES AGAIN */ UPDATE t_file SET fil_content_c = convert_from(loread(lo_open(fil_content_c::int, CAST( x'20000' AS integer)), 999999999), 'UNICODE')::TEXT WHERE fil_content_c IS NOT NULL AND fil_content_c != '' AND LENGTH(fil_content_c) <= 6 AND is_valid_utf8(fil_content_c::bytea) IS TRUE ; /*We fix again useless linefeeds by trimming*/ UPDATE t_file SET fil_content_c = TRIM(e'\n' FROM fil_content_c) WHERE fil_content_c LIKE E'%\n' ; /* Now that we converted all the LOB stuff we do mass processing for remaining stuff with length lesser than 6 chars because those OCR values are just crap WARNING: DO NOT RUN THIS BEFORE CONVERTING BECAUSE YOU WILL OVERWRITE. IF YOU DID YOU WILL NEED TO REPROCESS ALL DOCUMENTS! */ UPDATE t_file SET fil_content_c = NULL WHERE fil_content_c IS NOT NULL AND fil_content_c != '' AND LENGTH(fil_content_c) <= 6 ; /*Finally we check again the values visually*/ SELECT fil_id_c, fil_name_c, fil_content_c FROM t_file /*Finally re-run the indexing from background UI web interface or API to have a good search index again*/ ``` ## Tesseract OCR command line binary The installation of tesseract is simple. Note that for different operating system versions there are different tesseract versions. All tesseract versions work different in their speed and quality. We figured out that tesseract 3 on Ubuntu 16 works much faster than tesseract 4 on Ubuntu 18. https://github.com/tesseract-ocr/tesseract/wiki ### Installation For Linux users: ```bash #install regular version sudo apt install tesseract-ocr tesseract-ocr-deu #will install the most recent version belonging to your OS. So older system you might get older tesseract #install devel version. See https://launchpad.net/~alex-p/+archive/ubuntu/tesseract-ocr-devel sudo add-apt-repository ppa:alex-p/tesseract-ocr-devel sudo apt-get update sudo apt install tesseract-ocr tesseract-ocr-deu #add your desired languages here ``` For Windows users: [https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#4x-for-windows](https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#4x-for-windows) ### Critical optimization [https://github.com/tesseract-ocr/tesseract/issues/2611](https://github.com/tesseract-ocr/tesseract/issues/2611) Some users said that disabling multiprocessing in tesseract fixes speed problems. Therefore some environment flag should be set using export. See also [Environment Configuration](https://wiki.stadtfabrikanten.org/books/inventar-und-handbucher/page/environment-configuration "Environment Configuration") ```bash export OMP_THREAD_LIMIT=1 ``` ## Scanner Apps for Smartphones There are a LOT of scanner apps in PlayStore. Most of them have nearly same naming. The following list is only a minimalistic overview of stuff around the web. Mainly we are looking for open source applications. - [Genius Scan](https://play.google.com/store/apps/details?id=com.thegrizzlylabs.geniusscan.free&hl=de) - [CamScanner](https://play.google.com/store/apps/details?id=com.intsig.camscanner&hl=de&gl=US) - [Notebloc](https://play.google.com/store/apps/details?id=com.notebloc.app&hl=en_IN) - [OpenNoteScanner](https://github.com/ctodobom/OpenNoteScanner) - [SwiftScan](https://swiftscan.app/de/index.html) Wishes - automatic upload to or sending by mail - problem: what if you use multiple instances of DMS? Then you will need multiple upload locations. All known app do not deal with that feature. With app cloning the scanner app could be multiplied so each Scanner app instance has its own configuration. Then the scanner app cand send to the correct inbox per DMS instance