All of the newspapers digitized for the Texas Digital Newspaper Program will be created according to the national standards set by Library of Congress for the National Digital Newspaper Program. View Library of Congress’s Technical Guidelines. We feel it is vital to meet the national standards for digitizing newspapers for preservation purposes and to meet a high level of functionality. The intention is to create a good quality product that will stand the test of time. Here’s a very basic list of the file types created in the digitization process for this standard:

  • Create digital images from a preservation copy of microfilm, a clean second-generation duplicate silver negative.
  • Scan at 8-bit grayscale with a resolution of 400 dpi, if possible; otherwise between 300 and 400 dpi (relative to the size of the original newspaper.)
  • Create image output file as an uncompressed TIFF 6.0, from which a JPEG2000, PDF and text derivatives with the same file name will be made.
  • Capture a standards-based target film strip at the start of each session, to monitor equipment performance.
  • Split dual images into individual newspaper images as necessary.
  • Deskew images with more than 3% skew.
  • Crop page image files to the edge of the newspaper, retaining the original edge and up to a quarter inch beyond.
  • Incorporate tagged metadata relating to the creation of the images into the headers for all image deliverables (TIFF 6.0, JPEG2000, and PDF).
  • Produce grayscale images that have exactly the same dimensions, spatial resolution, skew, and cropping as the images used for OCR.
  • To support the goals of the NDNP program, both structural and technical metadata will be created. The role of structural metadata is to relate pages to title, date, and edition, sequence pages within issue or section; and to identify image and OCR files according to the specifications
  • Create OCR text conversion with the following specifications:
    • Deliver one OCR text file per page image, with a file name that corresponds to the appropriate page image.
    • Ensure that page images delivered to LC correspond exactly to the dimensions, orientation, and skew to those used for the OCR.
    • Create text output in UTF-8 character set.
    • Ensure no graphic elements are embedded in the OCR text.
    • Order OCR text column-by-column.
    • Create OCR text file with bounding-box coordinate data at the word level.
    • Produce OCR text files that conform to the ALTO XML schema, version 1-1-041 or greater, with additional specifications as stated in Appendix B – File Format Profiles of the NDNP Technical Guidelines for Applicants.
    • Create a PDF image with hidden text, specifications below.
    • Provide, if possible, the confidence level data at the page, line, character, and/or word level.
    • Provide, if possible, the point size and font data at the character or word level.