Dangl.TextConverter
Compatibility
This project targets netstandard1.3
, netstandard2.0
and net45
. Due to .Net 4.5.2 being the currently latest supported version
by Microsoft and the xUnit test suite, no tests are run for net45
and net451
. No tests are run for .NET Core below the 2.0 release.
Project Configuration
If this project is consumed in a project using the full .Net framework with a newer version of
Antlr4.Runtime
, the necessary AssemblyBindingRedirects are not automatically generated with the current
dotnet CLI tooling. This is scheduled to be fixed with the 2.0 release. In the meantime, the following should
be added to the consumers csproj
:
<PropertyGroup Condition=" '$(TargetFramework)' == 'net461' ">
<AutoGenerateBindingRedirects>true</AutoGenerateBindingRedirects>
<GenerateBindingRedirectsOutputType>true</GenerateBindingRedirectsOutputType>
</PropertyGroup>
The Condition=" '$(TargetFramework)' == 'net461' "
attribute may be changed as necessary or removed.
Split Html by Class
The Dangl.TextConverter.Html.HtmlToText.ConvertHtmlToPlaintextAndSplitByClassname(string html, string[] classNamesToSplit)
method will transform Html to plain text and additionally split it by class names. It will return a list of objects looking like this:
{
public string Text { get; }
public HtmlNode HtmlNode {get;}
}
You can access the complete HtmlNode
it was split on and get all the classes, attributes and other data you might need. HtmlNode
is null
for segments it did not split on the classname.
Example
The following Html:
<p>
Intro
<div id="077b8e46-31e6-45f5-b1b0-a8210e48259b" class="text-addition text-addition-owner" text-addition-label="12">
<div class="text-addition-body">
<span>Body</span>
</div>
</div>
Outro
</p>
split on text-addition
returns the following:
[
{
"Text": "Intro",
"HtmlNode": null
},
{
"Text": "Body",
"HtmlNode": { "ClassNames": [ "text-addition", "text-addition-owner" ] }
},
{
"Text": "Outro",
"HtmlNode": null
}
]
Transform Rtf Text to Segmented PlainText
By using the Dangl.TextConverter.Rtf.RtfToText.ConvertRtfToSegmentedText(string rtfInput)
method, Rtf text is converted to plain text and segmented
by Rtf bookmarks. This will return text segments that contain plain text representations of the texts as well as tags to indicate the opening and closing of bookmarks.
This is used, for example, in the GAEB & AVA .Net Libraries by DanglIT to work with text additions in GAEB 2000 files.
the following Rtf text:
{\rtf1The value is {\bkmkstart TA31}to be entered{\bkmkend TA31}}
will return the following segments (simplified example for demonstration):
[
{
"ClassName": "RtfTextSegment",
"Text": "The value is "
},
{
"ClassName": "RtfBookmarkStartSegment",
"Identifier": "TA31"
},
{
"ClassName": "RtfTextSegment",
"Text": "to be entered"
},
{
"ClassName": "RtfBookmarkEndSegment",
"Identifier": "TA31"
}
]
Rtf Line Start Normalization
The extension public static string StringLineStartNormalizationExtensions.NormalizeLineStarts(this string source)
can be used to fix strings that are indented below the second line.
For example, the German GAEB 2000 standard uses data formats similar to this:
#begin[Field]This string starts here
But on the second line it's indented!
That should be normalized!
#end[Field]
If you extract the string between the #begin[Field]
and the #end[Field]
tags, you get something like this:
This string starts here
But on the second line it's indented!
That should be normalized!
All but the first lines are indented. To fix such strings, the extension method can be used.
HtmlAgilityPackCompatibility
With version 1.10.0, some breaking API changes were introduced to the HtmlAgilityPack. The handling of self-closing or not-closed <p>
tags was changed. This setting is controlled by the
static HtmlDocument.DisableBehavaiorTagP
property, which defaults to false. Internally, the HtmlToText
class sets this property to false
in its static method calls to maintain
compatibility with downstream packages by DanglIT. If you independently rely on the new behavior, please make sure to wrap calls to Dangl.TextConverter's HtmlToText
class
and make restore the option to its previous state.
private static void EnsureLegacyBehavior()
{
if (HtmlDocument.DisableBehavaiorTagP)
{
HtmlDocument.DisableBehavaiorTagP = false;
}
}
Assembly Strong Naming & Usage in Signed Applications
This module produces strong named assemblies when compiled. When consumers of this package require strongly named assemblies, for example when they
themselves are signed, the outputs should work as-is.
The key file to create the strong name is adjacent to the csproj
file in the root of the source project. Please note that this does not increase
security or provide tamper-proof binaries, as the key is available in the source code per
Microsoft guidelines