donderdag 3 augustus 2017

(Elastic) Searching?

No matter how much effort you put into bringing your information together in a repository, it remains dead and worthless unless your users can find what they are looking for.

By default Content Server comes with a number of options to search the content. 3 options to be exact: metadata only, which basically means just do regular SQL queries, Database Fulltext, the older indexing in oracle DB and OracleText.
Most people will use OracleText when fulltext searching is needed.

There are a number of other search engines around and I have started to take an interest in the NoSQL engines that are freely available on the web.

First one in Apache Lucene. You can get it for free and it does what is says it does: index content.

Second one, and the one that has the most buzz at the moment is ElasticSearch. ES is largely based on Lucene, both it offers a lot of additional features. ES is free to download and use, but it's the add-ons that will cost you, like the x-pack that will allow to implement security in your index.

Now I know that there are a number of integration's out there that already offer ES for WCC but as far as I know most are based on a crawler principle.

I started a project for myself to make my own integration. I wanted to hook ES into WCC the same way as OracleText. So basically the indexing of the document is part of the release cycle, and OTS is not used at all. In the config.cfg file SearchIndexerName will be set to ELASTIC. Searching is done from within Content Server, and it is Content Server that remains in control of the security on the documents.

Second point I wanted to achieve, is make sure that I could take advantage of the clustering capabilities of ElasticSearch. So I started of to create an integration that uses the Elastic TransportClient API.

There are 2 sides to this approach:
- it surely is very flexible, while being completely invisible to the users
but
- queries get executed on different servers via network connections, so there could be a possible overhead there.....

I am interested to see the feedback of the community.
  • What is your take on this? Sound like a good approach? 
  • Also, would it be worth while to have the possibility to combine fulltext with special fields like the DocTags fields for the Siebel integration?
  • Would you be interested in using this?
 Let me know....

So you have a performance issue............. Database

Content Server and the database. That is a very tight marriage indeed.

Everything you do, every single service that is called, uses the database, and most of them will issue several queries. If those queries do not execute fast enough, things will go south very fast.

So how do you get an indication that your database could be a bottleneck?

First place to look is the System Audit Page. This page as you may know has a number of sections that by default come up closed. It is worth while to open up the Database section. if this section lists a number of queries that are reported to have taken a long time, you should really start to look at your database.

Get in touch with your DBA's! It could be something simple as system settings of the database. Typically things like SGA, RAM of the server (always avoid swapping!).
Next look at the CSDS database connection provider. By default WCC will create a provider at install that will allow only 15 connections. If you system has a lot of concurrent users you will need to increase this. For people who have been around since before 11g: increasing this vaule in the config.cfg file of WCC itself does not work anymore. The connection pool is now manaed by WLS.

Interesting to note is that when you increase the connectionpool, this will not automatically increase the number of connections that are used for searching. WCC will limit the number of connections from the pool that it will use for OracleText work. If you have increased your connectionpool and you also want to give some extra air to OracleText make sure to add the following to your config.cfg:
MaxSearchConnections=15

Sometimes things can be a bit deeper though. I recently saw a case where the DB appeared to be doing virtually nothing: RAM was OK, CPU has very little usage, and yet the DB was slow.
It took us some time to find the cause. What we observed was that when processing started things went well for a couple of minutes and after that performance degraded. Looking at the performance graphs from grid control we noticed a lot of light-blue....

What happened was that the disk IO was too slow. Once the SGA fills up, it needs to 'commit'. In other words, the DB starts to write the changes it holds in it memory to the actual storage. Now in this case the storage was performing really bad. Turns out the storage was on virtualization. And believe me that is not a good idea. Keep the DB as bare-metal as possible.

Another thing that is often overlooked is OracleText. OTS will get fragmented.
Now what does that mean? Well, when OTS indexes documents it analyzes the text and then stores what words it found where. by the nature of WCC: documents get added by users 1 at a time, there will be multiple records for the same word. And that will slow down performance. The solution to this is to run a regular optimization of the index. Basically this takes all the records for a specific word and merges them together. This can have a serious impact on the performance of OTS. See the following article for details: DocID 1087777.1.
And if you can, update you database to version 12c. There is a major difference between 11g and 12c with regards to optimizing the index. 11g will limit the number of occurrences of a word to 2000 per record. So words that occur more than 2000 times in your repository will still have a fragmented index. 12c does not have this limit.
And by the way, if the standard way of getting fragmentation information takes to long (using ctx_report.index_stats) there is a faster and more direct way to at least get an estimate of the fragmentation:

select avg(tfrag) from
 ( select /*+ ORDERED USE_NL(i) INDEX(i DR$FT_IDCTEXT1$X) */ 
           i.token_text,
           (1-(least(round((sum(dbms_lob.getlength(i.token_info))/3800)+(0.50 - (1/3800))),count(*))/count(*)))*100 tfrag
   from (  select token_text, token_type 
           from text_user.
FT_IDCTEXT1$i sample(0.149)
           where rownum <= 100 ) 
         t, text_user.
FT_IDCTEXT1$i i
   where i.token_text = t.token_text
     and i.token_type = t.token_type
    group by i.token_text, i.token_type
);

This example applies when WCC reports that OT1 is the active index.

And don't forget: doing a rebuild of the index from within WCC will not solve fragmentation.

So you have a performance issue.............

When you have been around long enough you tend to get called in whenever things are not running smoothly.
Often I guess people expect that there will be some magic config setting that solves everything. Guess what, if this did exist, I surely don't know it.

No, solving performance issues is like peeling an onion down to the core. You have to look at all the different layers, not just the application.
What is often forgotten is that an application like Webcenter Content depends on the underlying architecture. This is probably my favorite analogy that a few people have already heard. WCC is a high speed train, but is the rails are bad, it won't move.

Over the years I have done my share of performance review and I have seen several things everybody should look at. Honestly, this would be too long for one single post, so I am going to split this up into several posts about this same topic. Yeah I know, that sounds like a cheap trick to up the article count...

These are the topics I will cover:
  • Database
  • Storage
  • Application itself
  • Network
Stay tuned.