[plug] New PLUG news server

Thu Sep 30 11:26:27 WST 2004

James Devenish <devenish at guild.uwa.edu.au> writes:

>In message <j0gq22xsne.ln2 at innovative.iinet.net.au>
>on Wed, Sep 29, 2004 at 06:39:47PM +0800, Bernd Felsche wrote:
>> You're asking lots of questions that are answered in the leafnode2
>> documentation.
>[...]
>> Maybe you should go back and review what I've written today.

>What questions, Bernd? My question was "Is someone going to pre-populate
>the groups with the existing articles, or is it intended that the
>newsgroups will only contain new threads?"

Re-constructing threads between before and after will require some
manual intervention.

On September the 11th this year I wrote:

	I've already pointed out that the newsgroups could be used
	as the archive; so no "duplicate" storage is required. It
	would require a local CGI to serve "archived" articles
	through a web interface.

It was not the intent to populate the main list newsgroup with
archive articles. The intent to provide another, more efficient
means of searching the archive "newsgroup" is obvious.

The purpose of having archives is not simply to have them; but to be
able to search them effectively. Experience tells me that to attempt
to try to search a million messages by NNTP operating through a damp
piece of string is folly.

You put the search engine on the same server to maximise the
bandwidth to the data repository, and pass the search parameters to
the search engine. (Such search parameters can include scanning the
article body as well as the headers. NN{TR}P search abilities are
limited to what's in the overview database.) The user can then
utilise the search engine results in whatever way they choose;
extracting the thread by normal NNTP if they so desire or only a
specific article.

Precisely *how* an archive would inter-operate with the active
newsgroup is as yet undetermined.

Personally, I'm looking at the leafnode code to see how it can cope
with "thread revival" and "thread expiry to archive" concepts. The
former could occur if somebody posts in response to an archive
article; bringing the entire thread back into the current list
context and hence preserving threading for relevant newsreaders.
Thread expiry would perform the reverse; instead of deleting the
threads from the active newsgroup when the thread becomes stale (as
is current expire functionality), to "retire" it to a specific
archive newsgroup instead.

The combination of the above is primarily aimed at keeping the
active newsgroup size within reasonable bounds for performance
reasons. See below for a description (*) of some of them.

Such an archival scheme wouldn't impact on anybody's ability to read
the archive newsgroups directly. Posting to the archive would of
course be prohibited; the Followup-To: in the archives should be set
to the original active newsgroup(s).

>All my other questions arise from sideways statements and
>distractions, like your attitudes like "it'd just be a pain for
>those who connect to the server for the first time" and "archived
>articles posted before the news server existed cannot logically be
>found in the newsgroup". 

The technical answer to the latter question should be evident from
the boundary problems of "broken" threads and bogus message-IDs that
are not permissible in news. Manual intervention will be required to
*build* appropriate threads. It's not a trivial exercise.

On September the 11th this year I also wrote:

	With news, I simply don't poll the server while I'm away.
	When I return, I can catch up on the last few hundred
	messages or the last week's messages in each newsgroup.

You probably saw that message on the PLUG list so why ask about
somebody going away for 2 weeks missing out?

>Since I imagine most people use a mailreader to read their mail, I
>don't know why you now persist in referring generally to "the
>leafnode2 documentation".

Because you're talking about things that are best answered by
reading the freely-available leafnode2 documentation.

(*)
Leafnode and other small-system news servers have performance issues
when newsgroups get large. This has significant impact when the
newsgroups are volatile; subject to frequent change. i.e. active
newsgroups.

Leafnode maintains a separate .overview database of header
information for articles within the newsgroup. That database is
unsophisticated and essentially a flat file that's accessed
linearly. One solution is to replace that overview method by
collusion with a "proper" database backend that has the capacity to
index and to hold additional information such as article newsgroup
names for each article. (NNTP overview doesn't provide the newsgroup
names for cross-posted articles.)

Such a method will allow for better scaling of the server; and
provide the potential for rapid search by external queries such as
the search engine discussed above; given additional external
processes to build word indexes based on article body content for
example.

There are also fundamental issues relating to many files in the same
directory. Filesystems such as Reiserfs hash directory entries so
tens of thousands of files in the same directory (news spool) are a
minor inconvenience. But when you approach millions of files,
standard utilities hit the wall; the shell takes forever to glob
even if it doesn't scream "arg list too long" and the prospect of
manual intervention becomes onerous. Files are still rapidly
addressable if you know the pathname, so the storage isn't broken. 
-- 
/"\ Bernd Felsche - Innovative Reckoning, Perth, Western Australia
\ /  ASCII ribbon campaign | I'm a .signature virus!
 X   against HTML mail     | Copy me into your ~/.signature
/ \  and postings          | to help me spread!