Resumé

XLinq XPS Parser

Cristian Merighi () 0.00

I've started to practice with .Net Framework 3.5, so I tried to exploit Linq to XML to solve a tiny problem of mine: retrieve the text contained in an XPS document.
This article is obsolete. Some functionalities might not work anymore. Comments are disabled.

Alright! Now that I have my brand new Visual Studio 2008 I can finally dive deeply into 3.x .Net Framework!
Where to start? I'd say D/X/Linq and... <random mode="on">XPS documents!</random>

What's the goal? parse an XPS Package (Document) in order to retrieve the text nested in it as a simple string object.

Well, you surely know that an XPS file is nothing else than a compressed file which contains a precise structure of folders and files. They're XML files plus embedded resources like images and fonts.

Let's concentrate on XMLs and the structure: XPS hierarchy provides basically a document sequence (list of documents) which is composed by a list of pages. Pages keep the content that I want to retrieve.

Let's see if my new buddy C# 3.0 can halp me...

// among the others referenced packages
using System.Windows.Xps.Packaging;
using System.IO.Packaging;
using System.Linq;
using System.Xml;
using System.Xml.Linq;

// ...[OMITTED CODE]
string path = @"c:\MyPath\MyDoc.xps";
XpsDocument xpsDocument = new XpsDocument(path, System.IO.FileAccess.Read);

Uri packageUri = new Uri(path, UriKind.Absolute);

Package xpsPackage = PackageStore.GetPackage(packageUri)

IXpsFixedDocumentSequenceReader ixps = xpsDocument.FixedDocumentSequenceReader;

// hard-coded index!
// in this sample code I assume to have at least one document in the package
// and I do not intend to iterate through the others.
IXpsFixedDocumentReader ireader = ixps.FixedDocuments[0];
// hard-coded index (2)!
// in this sample code I assume to have at least one page in this document
// and I do not intend to iterate through the others.
Uri pageUri = ireader.FixedPages[0].Uri;
PackagePart pagePkg = xpsPackage.GetPart(pageUri);

XmlReader xread = XmlReader.Create(pagePkg.GetStream());
// XLINQ on the stage...
XDocument xdoc = XDocument.Load(xread);
xread.Close();
// Take care of page's XML default namespace!
XNamespace xmlns = "http://schemas.microsoft.com/xps/2005/06";
XName x = xmlns + "Glyphs"; // x becomes something like 
                            // {{http://schemas.microsoft.com/xps/2005/06}Glyphs}
                            // this is the C# 3.0 way to concatenate namespaces and local names
XName ox = "OriginX";
XName oy = "OriginY";
XName u = "UnicodeString";

// XLINQ query
// beware the weak sorting algorithm used in this query!
// it's basically for academic purposals!
var glyphs = from n in xdoc.Descendants(x)
             where
             n.Attribute(ox) != null && n.Attribute(oy) != null
             orderby (double)n.Attribute(ox),
                 (double)n.Attribute(oy)
             select (string)n.Attribute(u);
             // what I get is an IEnumerable of string, that's what the query yields

// finally my page content as a sole string
string content = string.Join(string.Empty, glyphs.ToArray<string>());

Well... that's all. As usual an article pretty bare of words, but - I hope - with enough code to make it useful.

Take care. Bye.

Feedbacks

  • Re: XLinq XPS Parser

    JLeRogue Wednesday, July 01, 2009 0.00

    Hi, could you link to the whole code? I've tried implementing this, but no luck. You say in the header you use 3.5, but in the body text, you say you use 3.0. Which is it? If I can get this to work, you are my hero, man. I haven't found anywhere else on the web where people are doing this.

feedback
 

Syndicate

Author

Cristian Merighi facebook twitter google+ youtube

Latest articles

Top rated

Archive

Where am I?

Author

Cristian Merighi facebook twitter google+ youtube

I'm now reading

Feeds