Skip to content

SgmlReader is "SGML" markup language parser, and derived from System.Xml.XmlReader in .NET CLR. (Portable class library version)

Notifications You must be signed in to change notification settings

kekyo/CenterCLR.SgmlReader

Repository files navigation

SgmlReader for Portable Class Library

SgmlReader for Portable Class Library

Status

  • NuGet Package: NuGet SgmlReader

  • NuGet Package (Older PCLs): NuGet SgmlReader (Older PCLs)

  • Continuous integration: AppVeyor SgmlReader

  • Currently CI tests are broken. (These tests passed by running manually)

What is this?

  • SgmlReader is "SGML" markup language parser, and derived from System.Xml.XmlReader in .NET CLR.
  • But, most popular usage the "HTML" parser. (It's scraper!!)

Easy usage for HTML parse and get truly XDocument.

// Open from stream
using (var stream = new FileStream("target.html", FileMode.Open, FileAccess.Read, FileShare.Read))
{
    // Parse Html mode (Easy usage)
    XDocument document = SgmlReader.Parse(stream);

    // Manipulate XDocument anything...
}

Details

External reference capability usage (SGML handling, not tested :-)

// Open stream
using (var stream = new FileStream("target.sgml", FileMode.Open, FileAccess.Read, FileShare.Read))
{
    var tr = new StreamReader(stream, Encoding.UTF8);

    // Define base uri
    var baseUri = new Uri("http://www.example.com/");

    // Setup SgmlReader
    var sgmlReader = new SgmlReader(stream, baseUri,

        // Stream opener delegate (Separate physical resource access)
        uri => new StreamInformation
            {
                Stream = WebRequest.Create(uri).GetResponse().GetResponseStream(),
                DefaultEncoding = Encoding.UTF8
            })

        {
            WhitespaceHandling = true,
            CaseFolding = CaseFolding.ToLower
        };

    // create document
    var document = new XmlDocument();
    document.PreserveWhitespace = true;
    document.XmlResolver = null;
    document.Load(sgmlReader);
}

Versions

  • 2018.8.31:
    • Fixed can't load Html.dtd embedded resource.
  • 2018.8.30:
    • Support for .NET Framework 4.0, .NET Standard 1.0/2.0 and .NET Core 2.0.
    • Obsoleted all PCLs and net40-Client. If you use these platforms, try to fixed nuget package version at older.
    • Obsoleted key-signed.
    • Switched to new MSBuild format.
  • 2017.6.12:
    • Support .NET Standard 1.0.
  • 2016.3.27.2:
    • Add .NET 3.5-Client/4.0-Client assembly (with serializable exception type).
  • 2016.3.27.1:
    • Refactor by target platforms.
    • Add PCL3 (dnxcore50)
  • 2014.12.7.3:
    • Add 1 line parse method.
  • 2014.12.7.2:
    • Direct handling the Stream class.
    • Initial parameter is set of Html parse mode.
  • 2014.12.7.1:
    • Namespace changed "CenterCLR.Sgml".
    • More easy usage, HTML parse is default mode.
    • Native store app library included.
  • 1.8.11.2014:
    • Initial release.

Derived original copyrights

/*
 * 
 * An XmlReader implementation for loading SGML (including HTML) converting it
 * to well formed XML, by adding missing quotes, empty attribute values, ignoring
 * duplicate attributes, case folding on tag names, adding missing closing tags
 * based on SGML DTD information, and so on.
 *
 * Copyright (c) 2002 Microsoft Corporation. All rights reserved. (Chris Lovett)
 *
 */

/*
 * 
 * Copyright (c) 2007-2013 MindTouch. All rights reserved.
 * www.mindtouch.com  oss@mindtouch.com
 *
 * For community documentation and downloads visit wiki.developer.mindtouch.com;
 * please review the licensing section.
 *
 * Licensed under the Apache License, Version 2.0 (the "License");
 * you may not use this file except in compliance with the License.
 * You may obtain a copy of the License at
 * 
 *     http://www.apache.org/licenses/LICENSE-2.0
 * 
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 *
 */

About

SgmlReader is "SGML" markup language parser, and derived from System.Xml.XmlReader in .NET CLR. (Portable class library version)

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published