10-05-07
Belgacom Skynet launches Skynet Exist
Skynet Exist ( french part | dutch part ) is a social networking site primarily addressed to Belgians (more interactions are with people who live and work nearby) who want to aggregate not only all the digital content they create and/or are interested in, but also all the digital information they may have compulsed over the time, into one place.
Its “raison d’être” is of two folds.
The first one was to build a Web 2.0-oriented application supporting the latest technologies.
Therefore, all the effort has been brought to implement the most commonly used techniques:
- Aggregation of data in RSS/Atom
- Extensive use of folksomonies in the form of tags or tagclouds
- Clean and meaningful URL
- Rich Internet application (Ajax)
The second one was to join the social networking trend as it is now seen as a strategic component.
The main features of the site are to search by tags in order to find interesting profiles and to aggregate content in order to share your centres of interest.
by Wang Olivier
16:01
Gepost door Barbieri Sébastien
in General |
Permalink
| Commentaren (0)
| Email dit
|
Facebook
|
24-04-07
Thesaurus multilingual: third part
And now, how to use it in a search?
The context (term inside []) cannot be used to search, example:
=> fuels <-> fuels
NT fuels [motors] <-> NT carburant
NT fuels [heating] <-> NT combustible
Suppose a document contains: …. « fuels » …’
When a French user searches “carburant”, it will be translated to:
“Carburant” <or> 0.70 “fuels [motors]”.
This document will not be retrieved.
We have to search:
“Carburant” <or> 0.70 “fuels”.
So, “[motors]” cannot be simply added to a keyword and becomes a part of it, but must be added as a ‘context’. The context is displayed in the thesaurus administration interface, and in a drop down list if you implement one in your search form, but not used in a search.
Search
In the monolingual thesaurus, the relationship gets the following default weights:
Synonym (USE) = 80%
Narrower term (NT) = 50%
Related term (RT) = 10%
For a language relationship, we will use:
Language = 70%
A language relationship will be represented by the language ISO code, fi FR, NL, DE, EN, …
An expanded search will find documents with translated keywords, fi:
Exact equivalence:
‘Bike’ –> {NL}’fiets’
A document contains: ‘Een rode fiets’.
The user search ‘Bike’, the search will be on: ‘bike’ <or> 70% ‘fiets’.
Inexact equivalence
See exact equivalence.
Partial equivalence
See exact equivalence.
Single to multiple equivalence
Fuels <-> carburant + combustible
=> ‘fuels’ <-> ‘fuels’
NT ‘fuels’ [motors] <-> NT ‘carburant’
NT ‘fuels’ [heating] <-> NT ‘combustible’
A document contains: ‘le charbon est un combustible fossile’.
The user search ‘fuels’ (in English), the search will be:
‘Fuels’ <or> 70 % ‘fuels’
<or> 50 % (‘fuels’ <or> 70 % ‘carburant’)
<or> 50 % (‘fuels’ <or> 70 % ‘combustible’)
The document will be found with a weight of 50% x 70% = 35%
Different meaning
Sometimes one term has different meaning in another language, for instance:
The French word ‘TOUR’ means towers, lathes, tours and rotation. This is solved by duplicating TOUR like:
TOUR [bâtiment] <-> towers
TOUR [outil] <-> lathes
TOUR [voyage] <-> tours
TOUR [rotation] <-> rotation
Conclusion
The implementation of multilingual thesaurus is a relatively light process.
Thesaurus multilingual: second part
The issue
In a country as Belgium with 2 or 3 languages, there is a new issue.
When someone looks in French for “Belgique”, he is maybe interested by documents containing “Belgïe” in Dutch or “Belgium” in English.
As explained in the previous post, a thesaurus is a tree of related words. We will suppose there are 2 languages and both trees are identicals. This gives the possibility to create a new kind of relationship.
The difficulty is then to make the trees identical. I give here a few tips to reach that goal.
Some definitions
- Dominant language: language for which all term are existing
- Secondary language: language for which all terms are not present
- Loan term: term of the dominant language used within a secondary language to avoid gap
Introduction
This study is based on the ‘equal status’ approach. When 2 languages have the same status, they have exactly the same tree structure. Only the synonyms (also called non-preferred terms) can be missing.
In practice, there is only one tree, each keyword has a label for each language, and only synonyms can be not present in all languages.
To be able to generate similar tree structure in different languages, the ISO 5964 norm gives some tips.
Exact equivalence
This is an ideal situation.

For instance :
VTT <-> mountain bike
Inexact equivalence
The 2 terms are not exactly equivalent. ISO recommend treating them as equivalent.

Partial equivalence
One term is higher than its translation (Broader), ISO recommend to treat them as equivalent.

Single to multiple equivalence
One term is equivalent to the sum of 2 or more terms in the other language.

ISO recommend duplicating the first term, for instance:
Fuels <-> carburant + combustible
=> fuels <-> fuels
NT fuels [motors] <-> NT carburant
NT fuels [heating] <-> NT combustible
In the example, 'fuels' is loaned to French and splitted for carburant and combustible.
Different meaning
Sometimes one term has different meaning in another language, for instance:
The French word ‘TOUR’ means towers, lathes, tours and rotation. This is solved by duplicating TOUR like:
TOUR [bâtiment] <-> towers
TOUR [outil] <-> lathes
TOUR [voyage] <-> tours
TOUR [rotation] <-> rotation
References
ISO 5964 – Guidelines for the establishment and development of multilingual thesauri
Next post, we will see how to use it in a web search
23-04-07
Full text search features
There are different kinds of full text search features.
The choice for a solution depends on the features you need.
Basic features
The basic features are:
- WORD: find a word into a text
- BOOLEAN: find ‘(a and b) or c’
- PHRASE: find a sequence of words in the same order, i.e. find ‘hello world’, this is different from ‘hello’ and ‘world’
- WILDCARD search with truncation, like play*
Advance features
A FTR engine can be very sophisticated, but even if a feature seems very interesting, the questions are:
- does the customer understand it? Use it? f.i. NEAR
- is this relevant for my application? f.i. ‘stemming’ a feature that let you find all derived words, fi: play – playing – played. This is of course not efficient in a Multilanguage environment
- what is the maintenance? f.i. ‘thesaurus’ is a very nice feature, but the knowledge (Brussels is a narrower term of Belgium, …) must be created and maintained, …
The advanced features are:
- NEAR(/n): find some words separated by maximum n words.
Probably too complicated for a standard web user - SENTENCE / PARAGRAPH: find some words in the same sentence / paragraph.
Probably too complicated for a standard web user - STEMMING: find a word and its derived forms, i.e. go -> going, went, …
It is only working in a single language environment - TYPO: search with syntax error, like elipse -> ellipse.
Can be very interestring as most of people write with typo - SOUND LIKE: find a word and its ‘sound like’ terms, like ‘Gorbatchev’ - ‘Gorbatchov’
It is language dependent, so only working in a single language environment - THESAURUS: this is a more sophisticated feature, find a word and its relations, synonyms, … i.e.: Belgium -> Antwerp, Brussels, Nato, ECC, …
Requires a knowledge database and a lot of maintenance, can be implemented in a multilingual environment - RELEVANCE RANKING: rank the hits by relevance (more match in the same text is more relevant than a few match in another, the document size is also important)
Maybe interesting - WEIGHT: correlated to the relevance ranking, this is the possibility to give a weight to each word used in a search, i.e. Belgium 100% or Brussels 50%
Maybe, but already complicated for a standard web user - MULTILANGUAGE: search in different language, i.e. Belgium -> Belgique, Belgïe. This can be combined with weight
Maybe, but does a user want to find a document in another language? - FILTER: be able to index document like word, excel, PowerPoint, pdf, …
Probably - FUZZY: used to search in an OCRed text. The OCR engine does not recognize all characters and some words are incorrect. This kind of search will find something like ‘Skynet’ on a text containing 'Sk@net' where the ‘y’ was not recognized by the OCR engine
Only interesting when you handle OCRed texts - STOP: a list of word to skip like ‘a’, ‘the’, …
Probably interesting to reduce the load on your search engine
Depending on your needs, you will have to choose for a certain product (google, topic, Oracle or SqlServer database, ...)
11:10
Gepost door Melkebeke Thierry
in Technical |
Permalink
| Commentaren (0)
| Email dit
| Tags: thesaurus, search, multilanguage |
Facebook
|
18-04-07
How to Store User Generated Content (part 2)
Light Multimedia Storage
From a technical and architectural point of view
Images/video stored in Databases as Binary Large Object (LOB)
It's the ideal case when efficiency is not needed and data consistency is the only thing that matters.
With databases, you don't need to care about how many servers you have, as every server has access to the database, every server has access to the content.
Experience shows that it is ideal while total size of User Generated Content (UGC) is lower than 10 GB, after that limit you will start experiencing performance problems and backups problems => huge restoration times => huge downtime.
With UGC size > 20 GB, the access time and the query speed will dramatically decrease. You may experience downtimes, maximum database connection reached, dead locks, abusive concurrent access, excessive IO…
In measurable terms, 20 GB of content means: 200 users uploading 10 daily pictures (total 1Mb) during 100 days.
Then you will need to redesign your solution.
Another disadvantage with databases is that there is no real image/video directly available, so every time an image/video is requested you will have to download it first from the database and then to provide the temporary (or definitive) location of the data to the user who will download it afterwards; this solution is known as caching.
Here are our graphics based on our real result:



Advantages / Disadvantages
Advantages:
- Data consistency is maintained by the database
- Easy to maintain
- Implicit cleaning
Disadvantages:
- If content is kept forever, the model is not a long-term design
- A Caching mechanism should be implemented
- The time needed to access the content may drastically increase if the database have a lot of IO (big tables, intensive requests)
Usage
This kind of UGC storage is ideal when the amount of image is kept small and don't last long (bargains, news…) anything which is limited in time and will be removed.
See also
17:28
Gepost door Barbieri Sébastien
in General |
Permalink
| Commentaren (2)
| Email dit
| Tags: user generated content, ugc, lob, nas, light multimedia storage, database |
Facebook
|
26-03-07
Is Second Life close to reality?
The virtual community is growing fast. During the last quarter of 2006, the number of “residents” has more than doubled. SL is now becoming a real phenomenon. Not only the early adaptors are talking about it but even people who don’t surf (yes, these people are still existing) know this virtual world.
There are currently more than 4.8 million residents living in SF. Of course the majority of them aren’t active visitors but nevertheless nobody did ever foresee this result.
More crazy is the fact that real money is circulating over there. More than 1million € is spent every day in SF. US Dollars can be convert in Linden dollars at a rate of 1US$ for 277L$. Like in real life you can earn or spend money. You can become rich or poor.
Residents can buy or sales land, products, clothes, and also pay taxes. But real companies are also present there (Coca-Cola, Philips, ING, IBM, Nissan etc.), in order to promote their brand and communicate about their product range.
Soon or later they will test all their new products in SF before launching them in the real world. Sure that it will be more powerful and less risky to test reactions/perceptions of residents at big scale and true different communities of SF. After all SL resident’s behavior is also the mirror of real people meaning real consumer.
The success of this virtual world is also due to the fact that people have the control. They create or produce what they want: avatar, house, rules, shops, games, movie, dvd…..there is almost no limit. They have full control on your own life. Ok it’s virtual but for some people it’s already better than reality
When we say anybody can do anything, it’s really the case.
Former Virginia governor M. Warner, when he intended to run for presidency, created Second Life's first political whistle-stop as an avatar of himself. Duran Duran and Suzanne Vega performed live concerts — playing into microphones live so the music transmits through their avatars
Few days ago Sweden announced it will be the first country to open a virtual embassy in the game. Online visitors at the "embassy" can learn how to apply for a visa or find some general information, which the Sweden hopes will help encourage real-world tourists.
Soon we‘ll maybe seen the creation of the Second Life NFL with same existing teams, playing the same match at the same time, with the same sponsors. But maybe with different results. We can also imagine to create the Second Life Tennis Association with the same players participating to the same Grand Slam and Masters Series with the difference that Roger Federer won’t win almost all tournaments.
SL is part of the Internet hype. Important growth, increase number of residents which means a huge amount of personal data. As often in this case success generates envy from hackers. SL has been hacked few months ago. Data of more than 650.000 users have been stolen. Hackers reach the general database through web servers of the company. As a consequence the password of all residents have been changed. Nothing dramatic. Just a big press coverage and a faster increase of residents.
Technology used for this platform is not that impressive. Beside the fact that the GUI is poor and certainly not as good as some online games, the code is evolving thanks to all users. A part of the source code is available in order to allow users to modify and plug software into the site, potentially accelerating Second Life's development.
So, is Second Life close to real world? Well…of course it’s not that easy to change your life in 5 clicks and you cannot get everything you want so easily. But the way people try to interact together, organize their life, try to set up some rules, the way companies are communicating to the residents etc. is not that far from real world.
Link:
12:18
Gepost door Massin Fabrice
in General |
Permalink
| Commentaren (0)
| Email dit
| Tags: virtual community |
Facebook
|
The end of the 30 sec spot ?
Most of TV viewers hate advertising (the so called “30 seconds spot”). The soccer fan doesn’t appreciate to be forced to watch a commercial (even when it’s about beer) when his team dominate the game, the teenager grouses when his daily soap is cut twice in 30 minutes and the mother doesn’t understand why her serial drama is always interrupted in the middle of an important action.
Of course advertisers know this. But they also know that people don’t want to miss part of their TV favorite program and that’s the reason why they are choosing to impose the 30 sec spot. We know they know and they know we know but they don’t care about us. They just thing about the money they receive.
So when the Digital Video Recorder (DVR) came on the market at a reasonable price we, the “victims”, though it was the end of the ads. We will never be obliged to watch something we don’t want when we don’t want. The total freedom…
The advertisers tried to fight against it, but they rapidly understood they will never win this battle. This time they have to care about the viewer in order to keep a positive ROI. Because we are able to zap the spot, they are obliged to make it more interesting or more funny and even sometimes dedicated to our needs.
New and creative solutions are appearing. In fact these are more dedicated to TV/DVR usage. The purpose is to capture viewer’s interest by letting him the choice to watch more and in a different way. The advertising product is less impersonal and brings a new experience to the viewer.
The deal between Tivo and Lexus is a good example for this kind of solution.
Tivo customers are able to watch info on the Lexus, locate the nearest dealer and schedule a test drive but they can also virtually customize a car to their needs. And this, only if they want.
Sure that the relations between viewers and advertisers is changing.
Links: