One of the most researched topics in computer vision, from regular handwritten digits and human faces to everyday objects and animals, have been extensively studied. Recent progress in deep learning (LeCun et al., 2015) has advanced object recognition performances significantly. From an image and video retrievals perspective, accurately recognizing specific objects is only the necessary first step, equally important is how we can make use of object recognition systems and results to help users to find the contents they are looking for. Therefore, researchers and practitioners should always bear in mind the ultimate purpose is to help indexing the contents more efficiently and effectively, and help making retrieval more accurately.
Scene level After recognizing objects in an image, interpreting, and understanding the scene, and inferring the semantic meaning beyond objects is extremely challenging (Xiao et al., 2013). For example, suppose we have a group of people at a scene, to be able to detect and recognize people Spain phone number list is not enough to understand what these people are doing. Whether they are having a meeting or having dinner or something else needs to be inferred from the object level understanding. Whilst humans have this unique ability to reason beyond what can be seen, to equip machines with the same ability is very difficult.
The task of scene understanding generally involve analyzing the 3D structure layout of the scene and the spatial, functional, and semantic relationships between objects. Again, in the context of video retrieval, how to best harvest results of scene understanding to make contents easily and readily available needs much more research. 3.3. Semantic/language level A picture may be worth a thousand words, but from an image retrieval s perspective, is the most useful to be able to find the words that accurately describe a picture. Recent years has seen much progress in image and video tagging and captioning (Hossain et al., 2019) where words and sentences are automatically generated to describe the visual content.