Dangl.TextConverter
Compatibility
This project targets netstandard1.3
, netstandard2.0
and net45
. Due to .Net 4.5.2 being the currently latest supported version
by Microsoft and the xUnit test suite, no tests are run for net45
and net451
. No tests are run for .NET Core below the 2.0 release.
Project Configuration
If this project is consumed in a project using the full .Net framework with a newer version of
Antlr4.Runtime
, the necessary AssemblyBindingRedirects are not automatically generated with the current
dotnet CLI tooling. This is scheduled to be fixed with the 2.0 release. In the meantime, the following should
be added to the consumers csproj
:
<PropertyGroup Condition=" '$(TargetFramework)' == 'net461' ">
<AutoGenerateBindingRedirects>true</AutoGenerateBindingRedirects>
<GenerateBindingRedirectsOutputType>true</GenerateBindingRedirectsOutputType>
</PropertyGroup>
The Condition=" '$(TargetFramework)' == 'net461' "
attribute may be changed as necessary or removed.
Split Html by Class
The Dangl.TextConverter.Html.HtmlToText.ConvertHtmlToPlaintextAndSplitByClassname(string html, string[] classNamesToSplit)
method will transform Html to plain text and additionally split it by class names. It will return a list of objects looking like this:
{
public string Text { get; }
public HtmlNode HtmlNode {get;}
}
You can access the complete HtmlNode
it was split on and get all the classes, attributes and other data you might need. HtmlNode
is null
for segments it did not split on the classname.
Example
The following Html:
<p>
Intro
<div id="077b8e46-31e6-45f5-b1b0-a8210e48259b" class="text-addition text-addition-owner" text-addition-label="12">
<div class="text-addition-body">
<span>Body</span>
</div>
</div>
Outro
</p>
split on text-addition
returns the following:
[
{
"Text": "Intro",
"HtmlNode": null
},
{
"Text": "Body",
"HtmlNode": { "ClassNames": [ "text-addition", "text-addition-owner" ] }
},
{
"Text": "Outro",
"HtmlNode": null
}
]
Transform Rtf Text to Segmented PlainText
By using the Dangl.TextConverter.Rtf.RtfToText.ConvertRtfToSegmentedText(string rtfInput)
method, Rtf text is converted to plain text and segmented
by Rtf bookmarks. This will return text segments that contain plain text representations of the texts as well as tags to indicate the opening and closing of bookmarks.
This is used, for example, in the GAEB & AVA .Net Libraries by DanglIT to work with text additions in GAEB 2000 files.
the following Rtf text:
{\rtf1The value is {\bkmkstart TA31}to be entered{\bkmkend TA31}}
will return the following segments (simplified example for demonstration):
[
{
"ClassName": "RtfTextSegment",
"Text": "The value is "
},
{
"ClassName": "RtfBookmarkStartSegment",
"Identifier": "TA31"
},
{
"ClassName": "RtfTextSegment",
"Text": "to be entered"
},
{
"ClassName": "RtfBookmarkEndSegment",
"Identifier": "TA31"
}
]
Rtf Line Start Normalization
The extension public static string StringLineStartNormalizationExtensions.NormalizeLineStarts(this string source)
can be used to fix strings that are indented below the second line.
For example, the German GAEB 2000 standard uses data formats similar to this:
#begin[Field]This string starts here
But on the second line it's indented!
That should be normalized!
#end[Field]
If you extract the string between the #begin[Field]
and the #end[Field]
tags, you get something like this:
This string starts here
But on the second line it's indented!
That should be normalized!
All but the first lines are indented. To fix such strings, the extension method can be used.